Data Provider Toolkit#
- class PreprocessedFieldMapping(tags: list[TagName], preprocessors: list[Callable])[source]#
Bases:
object- preprocessors: list[Callable]#
- tags: list[TagName]#
- class DataProviderFieldPreprocessors[source]#
Bases:
object- static cast_datetime_to_date(column: DataColumn) DataColumn[source]#
Cast datetime values to date type.
Converts a column containing datetime values to date32 type, discarding time information.
- Parameters:
column – Column containing datetime values
- Returns:
Column with values cast to date type
- Return type:
- static convert_millions_to_units(column: DataColumn) DataColumn[source]#
Convert financial values from millions to individual units.
Takes a column containing values expressed in millions and multiplies each value by 1,000,000 to convert to standard units.
- Parameters:
column – Column containing values in millions
- Returns:
Column with values converted to standard units
- Return type:
- class DataProviderToolkit[source]#
Bases:
object- DISCREPANCY_TABLE_SEPARATOR_CHARACTER: ClassVar[str] = '-'#
- DISCREPANCY_TABLE_SEPARATOR_MAX_WIDTH: ClassVar[int] = 80#
- static consolidate_processed_endpoint_tables(*, processed_endpoint_tables: ProcessedEndpointTables, table_merge_fields: list[EntityField], predominant_order_descending: bool = False) ConsolidatedFieldsTable[source]#
Consolidate multiple endpoint tables into a single unified table.
Merges processed tables from different endpoints by their primary keys, preserving row order and coalescing values from different endpoints. Validates that common columns across endpoints have consistent values for shared rows.
- Parameters:
processed_endpoint_tables – Dictionary mapping endpoints to their processed tables
table_merge_fields – List of entity fields to use as primary keys for merging
predominant_order_descending – Whether the predominant ordering is descending
- Returns:
Consolidated table containing all data from all endpoints
- Return type:
ConsolidatedFieldsTable
- Raises:
DataProviderMultiEndpointCommonDataDiscrepancyError – When common columns have inconsistent values across endpoints
DataProviderMultiEndpointDuplicateKeysError – When an endpoint table has multiple rows sharing the same primary key
DataProviderMultiEndpointNullColumnsError – When any endpoint table contains a column whose every value is null
DataProviderToolkitRuntimeError – When no tables contain required primary key columns
- classmethod create_endpoint_tables_from_json_mapping(endpoint_json_strings: dict[Endpoint, str]) EndpointTables[source]#
Create endpoint tables from JSON string representations.
Parses JSON strings for each endpoint and converts them into PyArrow tables, handling both JSON arrays and newline-delimited JSON formats.
- Parameters:
endpoint_json_strings – Dictionary mapping endpoints to their JSON string data
- Returns:
Dictionary mapping endpoints to parsed PyArrow tables
- Return type:
EndpointTables
- Raises:
DataProviderToolkitRuntimeError – When JSON parsing fails for any endpoint
- classmethod drop_discrepant_processed_endpoint_tables_rows(*, discrepancy_table: EndpointDiscrepanciesTable, processed_endpoint_tables: ProcessedEndpointTables, key_column_names: list[str]) EndpointTables[source]#
Drop discrepant rows from processed endpoint tables.
Removes rows in each endpoint table whose primary keys match the discrepancy table, returning trimmed copies. Used when the discrepant rows cannot be reconciled and the surviving rows should be retained.
- Parameters:
discrepancy_table – Table containing primary keys of discrepant rows
processed_endpoint_tables – Dictionary mapping endpoints to their processed tables
key_column_names – List of primary key column names
- Returns:
Dictionary mapping endpoints to tables with discrepant rows dropped
- Return type:
EndpointTables
- static find_common_table_missing_rows_mask(common_rows_table: Table, subset_rows_table: Table) BooleanArray | None[source]#
Identify rows in common table that are missing from subset table.
Performs a null-safe comparison between two tables by column position to determine which rows in the common table are not present in the subset table.
- Parameters:
common_rows_table – Table containing all potential rows
subset_rows_table – Table containing a subset of rows to check against
- Returns:
Boolean mask where True indicates missing rows, or None if common table is empty
- Return type:
pyarrow.BooleanArray or None
- Raises:
DataProviderToolkitArgumentError – When tables have different number of columns
- classmethod format_consolidated_discrepancy_table_for_output(*, discrepancy_table: Table, output_column_renames: list[str] | dict[str, str], csv_separator: str = '|') str[source]#
Format a discrepancy table as CSV string for output.
Converts a PyArrow table to CSV format with renamed columns and specified separator, preserving datetime object formatting. The CSV body is wrapped between two separator lines built from
DISCREPANCY_TABLE_SEPARATOR_CHARACTER, whose width matches the header line, capped atDISCREPANCY_TABLE_SEPARATOR_MAX_WIDTHcharacters.- Parameters:
discrepancy_table – Table containing discrepancy data to format
output_column_renames – New column names as positional list or mapping dictionary
csv_separator – Character to use as CSV field separator
- Returns:
CSV-formatted string representation of the table, wrapped between separator lines
- Return type:
str
- classmethod format_endpoint_discrepancy_table_for_output(*, data_block: type[BaseDataBlock], discrepancy_table: EndpointDiscrepanciesTable, endpoints_enum: StrEnum, endpoint_field_map: EndpointFieldMap, csv_separator: str = '|') str[source]#
Format an endpoint discrepancy table with provider-specific naming.
Converts internal column naming (entity.field format) to provider endpoint tag format (endpoint.tag) and outputs as CSV string.
- Parameters:
data_block – Data block class defining the entity structure
discrepancy_table – Table containing endpoint discrepancy data
endpoints_enum – Enum defining available endpoints
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
csv_separator – Character to use as CSV field separator
- Returns:
CSV-formatted string with provider-specific column names
- Return type:
str
- Raises:
DataProviderToolkitRuntimeError – When column name parsing fails
- classmethod get_provider_tag_for_entity_column(*, data_block: type[BaseDataBlock], endpoint: Endpoint, endpoint_field_map: EndpointFieldMap, entity_column_name: str) TagName[source]#
Return the provider tag for an endpoint’s entity field column.
- Parameters:
data_block – Data block class defining the entity structure
endpoint – Endpoint whose field map should be consulted
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
entity_column_name – Column name in EntityName.field_name format
- Returns:
Provider tag for the given column. For fields whose mapping is a PreprocessedFieldMapping, a plus-joined composite of its input tags is returned.
- Return type:
TagName
- Raises:
DataProviderToolkitRuntimeError – When entity_column_name is not in EntityName.field_name format
- classmethod process_endpoint_tables(*, data_block: type[BaseDataBlock], endpoint_field_map: EndpointFieldMap, endpoint_tables: EndpointTables) ProcessedEndpointTables[source]#
Process raw endpoint tables through remapping and preprocessing.
Transforms provider-specific tag names to entity.field format and applies configured preprocessor functions to compute derived fields from raw data.
- Parameters:
data_block – Data block class defining the entity structure
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
endpoint_tables – Dictionary mapping endpoints to raw data tables
- Returns:
Dictionary mapping endpoints to processed tables with standardized column names and computed fields
- Return type:
ProcessedEndpointTables
- Raises:
DataProviderToolkitArgumentError – When data_block is not a BaseDataBlock subclass
DataProviderToolkitNoDataError – When all provided tables are empty
DataProviderToolkitRuntimeError – When preprocessor execution fails