DataColumn#

The DataColumn class is the foundational abstraction for all tabular operations in Data Curator. It wraps a pyarrow.Array and adds element-wise operations, comparison logic, and composability for calculated columns.

It powers the calculated column system, Boolean logic in filters, type-safe arithmetic across datasets, and more.

Overview#

At its core, DataColumn:

Encapsulates a pyarrow.Array.
Enables arithmetic and comparison operations (+, -, ==, //, etc.).
Ensures null propagation and broadcasting consistency.
Supports composable transformations for use in custom calculations.

Basic Usage#

from kaxanuk.data_curator.modules.data_column import DataColumn

col_a = DataColumn.load([1, 2, 3])
col_b = DataColumn.load([10, 20, 30])

result = col_a + col_b   # Element-wise addition
filtered = col_a > 1     # Element-wise comparison returns boolean DataColumn

result.to_pandas()       # Export to pandas
result.to_pyarrow()      # Export to pyarrow

Arithmetic Operators#

You can apply arithmetic operations directly using standard Python syntax:

+ (via __add__)
- (via __sub__)
* (via __mul__)
/ (via __truediv__)
// (via __floordiv__)
% (via __mod__)

Reflected versions like 3 + col also work thanks to:

__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rmod__

All operations return a new DataColumn, with null-aware and type-safe behavior:

col = DataColumn.load([2, 4, 6])

col + 1         # [3, 5, 7]
col * 2         # [4, 8, 12]
10 - col        # [8, 6, 4]
col / 2         # [1.0, 2.0, 3.0]

Comparison Operators#

DataColumn supports element-wise comparison using standard Python syntax:

== (via __eq__)
!= (via __ne__)
< (via __lt__)
<= (via __le__)
> (via __gt__)
>= (via __ge__)

Each of these returns a new DataColumn of boolean values:

col = DataColumn.load([5, 10, 15])

col > 7       # [False, True, True]
col == 10     # [False, True, False]
col <= col    # [True, True, True]

Logical Functions#

Combine multiple boolean columns using:

DataColumn.boolean_and(…)
DataColumn.boolean_or(…)

These work across multiple DataColumn objects, pyarrow scalars or booleans. Optionally enable allow_null_comparisons=True for Kleene logic.

DataColumn.boolean_and(col1 > 0, col2 < 100)
DataColumn.boolean_or(col1.is_null(), col2 == 5)

Equality Utilities#

To compare full columns for equality:

equal(…) — element-wise
fully_equal(…) — total match (returns True, False or None)

DataColumn.equal(col1, col2, equal_nulls=True)
DataColumn.fully_equal(col1, col2, skip_nulls=True)

String Concatenation#

Use concatenate to merge string-type columns or scalars:

DataColumn.concatenate(col1, col2, separator="-", null_replacement="N/A")

This returns a new DataColumn with joined string values.

Loading and Conversion#

You can create and convert DataColumn objects easily:

DataColumn.load(…) — from list, pandas.Series, or pyarrow.Array
.to_pandas() — convert to pandas.Series
.to_pyarrow() — convert to pyarrow.Array
.type — get native pyarrow type
.is_null() — detect if the array is fully null

Advanced Behavior#

Null-safe math and broadcasting are internally managed through helper methods:

_mask_dual_array_nulls(…)
_replace_array_mask_with_nones(…)
_return_null_column_on_null_operand(…)

These ensure safe and predictable behavior in pipelines, especially in user-defined calculations.

API Reference#

class DataColumn(array: Array, /)[source]#

Bases: object

MAX_FLOAT_EPSILON_UNITS_DISCREPANCY = 128#

classmethod boolean_and(*columns: DataColumn | Scalar | bool, allow_null_comparisons: bool = False) → DataColumn[source]#

Perform a logical AND comparison on multiple DataColumns.

Parameters:

*columns – The columns to be combined with boolean AND logic.
allow_null_comparisons – Whether to allow null comparisons with Kleene logic. Default is False, which outputs null on any row containing any null value.

Return type:

A new DataColumn containing the result of the logical AND comparison.

classmethod boolean_or(*columns: DataColumn | Scalar | bool, allow_null_comparisons: bool = False) → DataColumn[source]#

Perform a logical OR comparison on multiple DataColumns.

Parameters:

*columns – The columns to be combined with boolean OR logic.
allow_null_comparisons – Whether to allow null comparisons with Kleene logic. Default is False, which outputs null on any row containing any null value.

Return type:

A new DataColumn containing the result of the logical OR comparison.

classmethod concatenate(*columns: DataColumn | Scalar | str, null_replacement: str = '', separator: str = '') → DataColumn | Scalar[source]#

Concatenate DataColumns into one DataColumn.

Parameters:

*columns ('DataColumn' | pyarrow.Scalar | str) – The columns to be concatenated. Each column can be either a ‘DataColumn’ object, a pyarrow.Scalar, or a string.
null_replacement (str, optional) – The value to be used as replacement for null values in the concatenated result. Defaults to an empty string.
separator (str, optional) – The separator to be used between concatenated values. Defaults to an empty string.

Returns:

A new DataColumn containing the concatenated rows of the input columns, or a pyarrow.Scalar if all columns were strings or scalars.

Return type:

DataColumn | pyarrow.Scalar

classmethod equal(column1: DataColumn, column2: DataColumn, /, *, approximate_floats: bool = False, equal_nulls: bool = False) → DataColumn[source]#

Compare two DataColumns element-wise.

Parameters:

column1 – The first column to compare.
column2 – The second column to compare.
equal_nulls – Specifies whether null values should be considered equal. Default is False.
approximate_floats – Specifies whether floating-point value equality should compensate for rounding errors. Default is False.

Return type:

A DataColumn containing a pyarrow.BooleanArray indicating element-wise equality between the two columns.

classmethod fully_equal(column1: DataColumn, column2: DataColumn, /, *, approximate_floats: bool = False, equal_nulls: bool = False, skip_nulls: bool = False) → bool | None[source]#

Check if two DataColumns are fully equal.

Parameters:

column1 – The first column to compare.
column2 – The second column to compare.
approximate_floats (bool, optional) – Whether to consider floats as approximately equal. If True, floating-point comparison will use tolerance. If not specified, the default value is False.
equal_nulls (bool, optional) – Whether to consider null values as equal. If True, null values will be treated as equal. If not specified, the default value is False.
skip_nulls (bool | None, optional) – Whether to skip null values during comparison. If True, null values will be ignored. If not specified, the default value is False.

Returns:

Returns None if equal_nulls is False and there are Nones, True if both columns are equal, False otherwise.

Return type:

bool

is_null() → bool[source]#

Check if the underlying pyarrow.Array is NullArray.

Returns:: whether or not the underlying pyarrow.Array is a NullArray
Return type:: bool

classmethod load(data: Iterable | DataColumn, dtype: DataType = None) → DataColumn[source]#

Wrap data (pyarrow.Array, pandas.Series, Iterable) in a new DataColumn object.

Parameters:

data – the data to be wrapped
dtype – the type of the underlying pyarrow.Array

Return type:

DataColumn

to_pandas() → Series[source]#

Force pandas to use PyArrow in the backend by means of ArrowExtensionArray.

Cf. https://pandas.pydata.org/docs/user_guide/pyarrow.html

Return type:: pandas.Series

to_pyarrow() → Array[source]#

Return the underlying native pyarrow.array object.

Return type:: pyarrow.Array

property type: DataType#

Return the underlying native pyarrow.array object type.

Return type:: pyarrow.DataType