Data Provider Toolkit#

class PreprocessedFieldMapping(tags: list[TagName], preprocessors: list[Callable])[source]#

Bases: object

preprocessors: list[Callable]#

tags: list[TagName]#

class DataProviderFieldPreprocessors[source]#

Bases: object

static cast_datetime_to_date(column: DataColumn) → DataColumn[source]#

Cast datetime values to date type.

Converts a column containing datetime values to date32 type, discarding time information.

Parameters:: column – Column containing datetime values
Returns:: Column with values cast to date type
Return type:: DataColumn

static convert_millions_to_units(column: DataColumn) → DataColumn[source]#

Convert financial values from millions to individual units.

Takes a column containing values expressed in millions and multiplies each value by 1,000,000 to convert to standard units.

Parameters:: column – Column containing values in millions
Returns:: Column with values converted to standard units
Return type:: DataColumn

class DataProviderToolkit[source]#

Bases: object

classmethod clear_discrepant_processed_endpoint_tables_rows(*, discrepancy_table: EndpointDiscrepanciesTable, processed_endpoint_tables: ProcessedEndpointTables, key_column_names: list[str], preserved_column_names: list[str]) → EndpointTables[source]#

Clear discrepant rows from processed endpoint tables.

Identifies rows in processed endpoint tables that match primary keys in the discrepancy table and sets non-preserved column values to null for those rows across all endpoints.

Parameters:

discrepancy_table – Table containing primary keys of discrepant rows
processed_endpoint_tables – Dictionary mapping endpoints to their processed tables
key_column_names – List of primary key column names
preserved_column_names – List of names of columns to preserve (not set to null)

Returns:

Dictionary mapping endpoints to tables with discrepant rows cleared

Return type:

EndpointTables

static consolidate_processed_endpoint_tables(*, processed_endpoint_tables: ProcessedEndpointTables, table_merge_fields: list[EntityField], predominant_order_descending: bool = False) → ConsolidatedFieldsTable[source]#

Consolidate multiple endpoint tables into a single unified table.

Merges processed tables from different endpoints by their primary keys, preserving row order and coalescing values from different endpoints. Validates that common columns across endpoints have consistent values for shared rows.

Parameters:

processed_endpoint_tables – Dictionary mapping endpoints to their processed tables
table_merge_fields – List of entity fields to use as primary keys for merging
predominant_order_descending – Whether the predominant ordering is descending

Returns:

Consolidated table containing all data from all endpoints

Return type:

ConsolidatedFieldsTable

Raises:

DataProviderMultiEndpointCommonDataDiscrepancyError – When common columns have inconsistent values across endpoints
DataProviderToolkitRuntimeError – When no tables contain required primary key columns

classmethod create_endpoint_tables_from_json_mapping(endpoint_json_strings: dict[Endpoint, str]) → EndpointTables[source]#

Create endpoint tables from JSON string representations.

Parses JSON strings for each endpoint and converts them into PyArrow tables, handling both JSON arrays and newline-delimited JSON formats.

Parameters:: endpoint_json_strings – Dictionary mapping endpoints to their JSON string data
Returns:: Dictionary mapping endpoints to parsed PyArrow tables
Return type:: EndpointTables
Raises:: DataProviderToolkitRuntimeError – When JSON parsing fails for any endpoint

static find_common_table_missing_rows_mask(common_rows_table: Table, subset_rows_table: Table) → BooleanArray | None[source]#

Identify rows in common table that are missing from subset table.

Performs a null-safe comparison between two tables by column position to determine which rows in the common table are not present in the subset table.

Parameters:

common_rows_table – Table containing all potential rows
subset_rows_table – Table containing a subset of rows to check against

Returns:

Boolean mask where True indicates missing rows, or None if common table is empty

Return type:

pyarrow.BooleanArray or None

Raises:

DataProviderToolkitArgumentError – When tables have different number of columns

static format_consolidated_discrepancy_table_for_output(*, discrepancy_table: Table, output_column_renames: list[str] | dict[str, str], csv_separator: str = '|') → str[source]#

Format a discrepancy table as CSV string for output.

Converts a PyArrow table to CSV format with renamed columns and specified separator, preserving datetime object formatting.

Parameters:

discrepancy_table – Table containing discrepancy data to format
output_column_renames – New column names as positional list or mapping dictionary
csv_separator – Character to use as CSV field separator

Returns:

CSV-formatted string representation of the table

Return type:

str

classmethod format_endpoint_discrepancy_table_for_output(*, data_block: type[BaseDataBlock], discrepancy_table: EndpointDiscrepanciesTable, endpoints_enum: StrEnum, endpoint_field_map: EndpointFieldMap, csv_separator: str = '|') → str[source]#

Format an endpoint discrepancy table with provider-specific naming.

Converts internal column naming (entity.field format) to provider endpoint tag format (endpoint.tag) and outputs as CSV string.

Parameters:

data_block – Data block class defining the entity structure
discrepancy_table – Table containing endpoint discrepancy data
endpoints_enum – Enum defining available endpoints
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
csv_separator – Character to use as CSV field separator

Returns:

CSV-formatted string with provider-specific column names

Return type:

str

Raises:

DataProviderToolkitRuntimeError – When column name parsing fails

classmethod process_endpoint_tables(*, data_block: type[BaseDataBlock], endpoint_field_map: EndpointFieldMap, endpoint_tables: EndpointTables) → ProcessedEndpointTables[source]#

Process raw endpoint tables through remapping and preprocessing.

Transforms provider-specific tag names to entity.field format and applies configured preprocessor functions to compute derived fields from raw data.

Parameters:

data_block – Data block class defining the entity structure
endpoint_field_map – Mapping from entity fields to provider tags per endpoint
endpoint_tables – Dictionary mapping endpoints to raw data tables

Returns:

Dictionary mapping endpoints to processed tables with standardized column names and computed fields

Return type:

ProcessedEndpointTables

Raises:

DataProviderToolkitArgumentError – When data_block is not a BaseDataBlock subclass
DataProviderToolkitNoDataError – When all provided tables are empty
DataProviderToolkitRuntimeError – When preprocessor execution fails

Data Provider Toolkit#

This Page