Data Provider Toolkit#

class PreprocessedFieldMapping(tags: list[TagName], preprocessors: list[Callable])[source]#

Bases: object

preprocessors: list[Callable]#

tags: list[TagName]#

class DataProviderFieldPreprocessors[source]#

Bases: object

static cast_datetime_to_date(column: DataColumn) → DataColumn[source]#

Cast datetime values to date type.

Converts a column containing datetime values to date32 type, discarding time information.

Parameters:: column – Column containing datetime values
Returns:: Column with values cast to date type
Return type:: DataColumn

static convert_millions_to_units(column: DataColumn) → DataColumn[source]#

Convert financial values from millions to individual units.

Takes a column containing values expressed in millions and multiplies each value by 1,000,000 to convert to standard units.

Parameters:: column – Column containing values in millions
Returns:: Column with values converted to standard units
Return type:: DataColumn

class DataProviderToolkit[source]#

Bases: object

DISCREPANCY_TABLE_SEPARATOR_CHARACTER: ClassVar[str] = '-'#

DISCREPANCY_TABLE_SEPARATOR_MAX_WIDTH: ClassVar[int] = 80#

static consolidate_processed_endpoint_tables(*, processed_endpoint_tables: ProcessedEndpointTables, table_merge_fields: list[EntityField], predominant_order_descending: bool = False) → ConsolidatedFieldsTable[source]#

Consolidate multiple endpoint tables into a single unified table.

Merges processed tables from different endpoints by their primary keys, preserving row order and coalescing values from different endpoints. Validates that common columns across endpoints have consistent values for shared rows.

Parameters:

processed_endpoint_tables – Dictionary mapping endpoints to their processed tables
table_merge_fields – List of entity fields to use as primary keys for merging
predominant_order_descending – Whether the predominant ordering is descending

Returns:

Consolidated table containing all data from all endpoints

Return type:

ConsolidatedFieldsTable

Raises:

DataProviderMultiEndpointCommonDataDiscrepancyError – When common columns have inconsistent values across endpoints
DataProviderMultiEndpointDuplicateKeysError – When an endpoint table has multiple rows sharing the same primary key
DataProviderMultiEndpointNullColumnsError – When any endpoint table contains a column whose every value is null
DataProviderToolkitRuntimeError – When no tables contain required primary key columns

classmethod create_endpoint_tables_from_json_mapping(endpoint_json_strings: dict[Endpoint, str]) → EndpointTables[source]#

Create endpoint tables from JSON string representations.

Parses JSON strings for each endpoint and converts them into PyArrow tables, handling both JSON arrays and newline-delimited JSON formats.

Parameters:: endpoint_json_strings – Dictionary mapping endpoints to their JSON string data
Returns:: Dictionary mapping endpoints to parsed PyArrow tables
Return type:: EndpointTables
Raises:: DataProviderToolkitRuntimeError – When JSON parsing fails for any endpoint

classmethod drop_discrepant_processed_endpoint_tables_rows(*, discrepancy_table: EndpointDiscrepanciesTable, processed_endpoint_tables: ProcessedEndpointTables, key_column_names: list[str]) → EndpointTables[source]#

Drop discrepant rows from processed endpoint tables.

Removes rows in each endpoint table whose primary keys match the discrepancy table, returning trimmed copies. Used when the discrepant rows cannot be reconciled and the surviving rows should be retained.

Parameters:

discrepancy_table – Table containing primary keys of discrepant rows
processed_endpoint_tables – Dictionary mapping endpoints to their processed tables
key_column_names – List of primary key column names

Returns:

Dictionary mapping endpoints to tables with discrepant rows dropped

Return type:

EndpointTables

static find_common_table_missing_rows_mask(common_rows_table: Table, subset_rows_table: Table) → BooleanArray | None[source]#

Identify rows in common table that are missing from subset table.

Performs a null-safe comparison between two tables by column position to determine which rows in the common table are not present in the subset table.

Parameters:

common_rows_table – Table containing all potential rows
subset_rows_table – Table containing a subset of rows to check against

Returns:

Boolean mask where True indicates missing rows, or None if common table is empty

Return type:

pyarrow.BooleanArray or None

Raises:

DataProviderToolkitArgumentError – When tables have different number of columns

classmethod format_consolidated_discrepancy_table_for_output(*, discrepancy_table: Table, output_column_renames: list[str] | dict[str, str], csv_separator: str = '|') → str[source]#

Format a discrepancy table as CSV string for output.

Converts a PyArrow table to CSV format with renamed columns and specified separator, preserving datetime object formatting. The CSV body is wrapped between two separator lines built from DISCREPANCY_TABLE_SEPARATOR_CHARACTER, whose width matches the header line, capped at DISCREPANCY_TABLE_SEPARATOR_MAX_WIDTH characters.

Parameters:

discrepancy_table – Table containing discrepancy data to format
output_column_renames – New column names as positional list or mapping dictionary
csv_separator – Character to use as CSV field separator

Returns:

CSV-formatted string representation of the table, wrapped between separator lines

Return type:

str

classmethod format_endpoint_discrepancy_table_for_output(*, data_block: type[BaseDataBlock], discrepancy_table: EndpointDiscrepanciesTable, endpoints_enum: StrEnum, endpoint_field_map: EndpointFieldMap, csv_separator: str = '|') → str[source]#

Format an endpoint discrepancy table with provider-specific naming.

Converts internal column naming (entity.field format) to provider endpoint tag format (endpoint.tag) and outputs as CSV string.

Parameters:

data_block – Data block class defining the entity structure
discrepancy_table – Table containing endpoint discrepancy data
endpoints_enum – Enum defining available endpoints
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
csv_separator – Character to use as CSV field separator

Returns:

CSV-formatted string with provider-specific column names

Return type:

str

Raises:

DataProviderToolkitRuntimeError – When column name parsing fails

classmethod get_provider_tag_for_entity_column(*, data_block: type[BaseDataBlock], endpoint: Endpoint, endpoint_field_map: EndpointFieldMap, entity_column_name: str) → TagName[source]#

Return the provider tag for an endpoint’s entity field column.

Parameters:

data_block – Data block class defining the entity structure
endpoint – Endpoint whose field map should be consulted
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
entity_column_name – Column name in EntityName.field_name format

Returns:

Provider tag for the given column. For fields whose mapping is a PreprocessedFieldMapping, a plus-joined composite of its input tags is returned.

Return type:

TagName

Raises:

DataProviderToolkitRuntimeError – When entity_column_name is not in EntityName.field_name format

classmethod process_endpoint_tables(*, data_block: type[BaseDataBlock], endpoint_field_map: EndpointFieldMap, endpoint_tables: EndpointTables) → ProcessedEndpointTables[source]#

Process raw endpoint tables through remapping and preprocessing.

Transforms provider-specific tag names to entity.field format and applies configured preprocessor functions to compute derived fields from raw data.

Parameters:

data_block – Data block class defining the entity structure
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
endpoint_tables – Dictionary mapping endpoints to raw data tables

Returns:

Dictionary mapping endpoints to processed tables with standardized column names and computed fields

Return type:

ProcessedEndpointTables

Raises:

DataProviderToolkitArgumentError – When data_block is not a BaseDataBlock subclass
DataProviderToolkitNoDataError – When all provided tables are empty
DataProviderToolkitRuntimeError – When preprocessor execution fails