Data Provider Toolkit#
- class PreprocessedFieldMapping(tags: list[TagName], preprocessors: list[Callable])[source]#
Bases:
object- preprocessors: list[Callable]#
- tags: list[TagName]#
- class DataProviderFieldPreprocessors[source]#
Bases:
object- static cast_datetime_to_date(column: DataColumn) DataColumn[source]#
Cast datetime values to date type.
Converts a column containing datetime values to date32 type, discarding time information.
- Parameters:
column – Column containing datetime values
- Returns:
Column with values cast to date type
- Return type:
- static convert_millions_to_units(column: DataColumn) DataColumn[source]#
Convert financial values from millions to individual units.
Takes a column containing values expressed in millions and multiplies each value by 1,000,000 to convert to standard units.
- Parameters:
column – Column containing values in millions
- Returns:
Column with values converted to standard units
- Return type:
- class DataProviderToolkit[source]#
Bases:
object- classmethod clear_discrepant_processed_endpoint_tables_rows(*, discrepancy_table: EndpointDiscrepanciesTable, processed_endpoint_tables: ProcessedEndpointTables, key_column_names: list[str], preserved_column_names: list[str]) EndpointTables[source]#
Clear discrepant rows from processed endpoint tables.
Identifies rows in processed endpoint tables that match primary keys in the discrepancy table and sets non-preserved column values to null for those rows across all endpoints.
- Parameters:
discrepancy_table – Table containing primary keys of discrepant rows
processed_endpoint_tables – Dictionary mapping endpoints to their processed tables
key_column_names – List of primary key column names
preserved_column_names – List of names of columns to preserve (not set to null)
- Returns:
Dictionary mapping endpoints to tables with discrepant rows cleared
- Return type:
EndpointTables
- static consolidate_processed_endpoint_tables(*, processed_endpoint_tables: ProcessedEndpointTables, table_merge_fields: list[EntityField], predominant_order_descending: bool = False) ConsolidatedFieldsTable[source]#
Consolidate multiple endpoint tables into a single unified table.
Merges processed tables from different endpoints by their primary keys, preserving row order and coalescing values from different endpoints. Validates that common columns across endpoints have consistent values for shared rows.
- Parameters:
processed_endpoint_tables – Dictionary mapping endpoints to their processed tables
table_merge_fields – List of entity fields to use as primary keys for merging
predominant_order_descending – Whether the predominant ordering is descending
- Returns:
Consolidated table containing all data from all endpoints
- Return type:
ConsolidatedFieldsTable
- Raises:
DataProviderMultiEndpointCommonDataDiscrepancyError – When common columns have inconsistent values across endpoints
DataProviderToolkitRuntimeError – When no tables contain required primary key columns
- classmethod create_endpoint_tables_from_json_mapping(endpoint_json_strings: dict[Endpoint, str]) EndpointTables[source]#
Create endpoint tables from JSON string representations.
Parses JSON strings for each endpoint and converts them into PyArrow tables, handling both JSON arrays and newline-delimited JSON formats.
- Parameters:
endpoint_json_strings – Dictionary mapping endpoints to their JSON string data
- Returns:
Dictionary mapping endpoints to parsed PyArrow tables
- Return type:
EndpointTables
- Raises:
DataProviderToolkitRuntimeError – When JSON parsing fails for any endpoint
- static find_common_table_missing_rows_mask(common_rows_table: Table, subset_rows_table: Table) BooleanArray | None[source]#
Identify rows in common table that are missing from subset table.
Performs a null-safe comparison between two tables by column position to determine which rows in the common table are not present in the subset table.
- Parameters:
common_rows_table – Table containing all potential rows
subset_rows_table – Table containing a subset of rows to check against
- Returns:
Boolean mask where True indicates missing rows, or None if common table is empty
- Return type:
pyarrow.BooleanArray or None
- Raises:
DataProviderToolkitArgumentError – When tables have different number of columns
- static format_consolidated_discrepancy_table_for_output(*, discrepancy_table: Table, output_column_renames: list[str] | dict[str, str], csv_separator: str = '|') str[source]#
Format a discrepancy table as CSV string for output.
Converts a PyArrow table to CSV format with renamed columns and specified separator, preserving datetime object formatting.
- Parameters:
discrepancy_table – Table containing discrepancy data to format
output_column_renames – New column names as positional list or mapping dictionary
csv_separator – Character to use as CSV field separator
- Returns:
CSV-formatted string representation of the table
- Return type:
str
- classmethod format_endpoint_discrepancy_table_for_output(*, data_block: type[BaseDataBlock], discrepancy_table: EndpointDiscrepanciesTable, endpoints_enum: StrEnum, endpoint_field_map: EndpointFieldMap, csv_separator: str = '|') str[source]#
Format an endpoint discrepancy table with provider-specific naming.
Converts internal column naming (entity.field format) to provider endpoint tag format (endpoint.tag) and outputs as CSV string.
- Parameters:
data_block – Data block class defining the entity structure
discrepancy_table – Table containing endpoint discrepancy data
endpoints_enum – Enum defining available endpoints
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
csv_separator – Character to use as CSV field separator
- Returns:
CSV-formatted string with provider-specific column names
- Return type:
str
- Raises:
DataProviderToolkitRuntimeError – When column name parsing fails
- classmethod process_endpoint_tables(*, data_block: type[BaseDataBlock], endpoint_field_map: EndpointFieldMap, endpoint_tables: EndpointTables) ProcessedEndpointTables[source]#
Process raw endpoint tables through remapping and preprocessing.
Transforms provider-specific tag names to entity.field format and applies configured preprocessor functions to compute derived fields from raw data.
- Parameters:
data_block – Data block class defining the entity structure
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
endpoint_tables – Dictionary mapping endpoints to raw data tables
- Returns:
Dictionary mapping endpoints to processed tables with standardized column names and computed fields
- Return type:
ProcessedEndpointTables
- Raises:
DataProviderToolkitArgumentError – When data_block is not a BaseDataBlock subclass
DataProviderToolkitNoDataError – When all provided tables are empty
DataProviderToolkitRuntimeError – When preprocessor execution fails