DataColumn#

The DataColumn class is the foundational abstraction for all tabular operations in Data Curator. It wraps a pyarrow.Array and adds element-wise operations, comparison logic, and composability for calculated columns.

It powers the calculated column system, Boolean logic in filters, type-safe arithmetic across datasets, and more.

Overview#

At its core, DataColumn:

  • Encapsulates a pyarrow.Array.

  • Enables arithmetic and comparison operations (+, -, ==, //, etc.).

  • Ensures null propagation and broadcasting consistency.

  • Supports composable transformations for use in custom calculations.

Basic Usage#

from kaxanuk.data_curator.modules.data_column import DataColumn

col_a = DataColumn.load([1, 2, 3])
col_b = DataColumn.load([10, 20, 30])

result = col_a + col_b   # Element-wise addition
filtered = col_a > 1     # Element-wise comparison returns boolean DataColumn

result.to_pandas()       # Export to pandas
result.to_pyarrow()      # Export to pyarrow

Arithmetic Operators#

You can apply arithmetic operations directly using standard Python syntax:

  • + (via __add__)

  • - (via __sub__)

  • * (via __mul__)

  • / (via __truediv__)

  • // (via __floordiv__)

  • % (via __mod__)

Reflected versions like 3 + col also work thanks to:

  • __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rmod__

All operations return a new DataColumn, with null-aware and type-safe behavior:

col = DataColumn.load([2, 4, 6])

col + 1         # [3, 5, 7]
col * 2         # [4, 8, 12]
10 - col        # [8, 6, 4]
col / 2         # [1.0, 2.0, 3.0]

Comparison Operators#

DataColumn supports element-wise comparison using standard Python syntax:

  • == (via __eq__)

  • != (via __ne__)

  • < (via __lt__)

  • <= (via __le__)

  • > (via __gt__)

  • >= (via __ge__)

Each of these returns a new DataColumn of boolean values:

col = DataColumn.load([5, 10, 15])

col > 7       # [False, True, True]
col == 10     # [False, True, False]
col <= col    # [True, True, True]

Logical Functions#

Combine multiple boolean columns using:

  • DataColumn.boolean_and(…)

  • DataColumn.boolean_or(…)

These work across multiple DataColumn objects, pyarrow scalars or booleans. Optionally enable allow_null_comparisons=True for Kleene logic.

DataColumn.boolean_and(col1 > 0, col2 < 100)
DataColumn.boolean_or(col1.is_null(), col2 == 5)

Equality Utilities#

To compare full columns for equality:

  • equal(…) — element-wise

  • fully_equal(…) — total match (returns True, False or None)

DataColumn.equal(col1, col2, equal_nulls=True)
DataColumn.fully_equal(col1, col2, skip_nulls=True)

String Concatenation#

Use concatenate to merge string-type columns or scalars:

DataColumn.concatenate(col1, col2, separator="-", null_replacement="N/A")

This returns a new DataColumn with joined string values.

Loading and Conversion#

You can create and convert DataColumn objects easily:

  • DataColumn.load(…) — from list, pandas.Series, or pyarrow.Array

  • .to_pandas() — convert to pandas.Series

  • .to_pyarrow() — convert to pyarrow.Array

  • .type — get native pyarrow type

  • .is_null() — detect if the array is fully null

Advanced Behavior#

Null-safe math and broadcasting are internally managed through helper methods:

  • _mask_dual_array_nulls(…)

  • _replace_array_mask_with_nones(…)

  • _return_null_column_on_null_operand(…)

These ensure safe and predictable behavior in pipelines, especially in user-defined calculations.

API Reference#

class DataColumn(array: Array, /)[source]#

Bases: object

MAX_FLOAT_EPSILON_UNITS_DISCREPANCY = 128#
classmethod boolean_and(*columns: DataColumn | Scalar | bool, allow_null_comparisons: bool = False) DataColumn[source]#

Perform a logical AND comparison on multiple DataColumns.

Parameters:
  • *columns – The columns to be combined with boolean AND logic.

  • allow_null_comparisons – Whether to allow null comparisons with Kleene logic. Default is False, which outputs null on any row containing any null value.

Return type:

A new DataColumn containing the result of the logical AND comparison.

classmethod boolean_or(*columns: DataColumn | Scalar | bool, allow_null_comparisons: bool = False) DataColumn[source]#

Perform a logical OR comparison on multiple DataColumns.

Parameters:
  • *columns – The columns to be combined with boolean OR logic.

  • allow_null_comparisons – Whether to allow null comparisons with Kleene logic. Default is False, which outputs null on any row containing any null value.

Return type:

A new DataColumn containing the result of the logical OR comparison.

classmethod concatenate(*columns: DataColumn | Scalar | str, null_replacement: str = '', separator: str = '') DataColumn | Scalar[source]#

Concatenate DataColumns into one DataColumn.

Parameters:
  • *columns ('DataColumn' | pyarrow.Scalar | str) – The columns to be concatenated. Each column can be either a ‘DataColumn’ object, a pyarrow.Scalar, or a string.

  • null_replacement (str, optional) – The value to be used as replacement for null values in the concatenated result. Defaults to an empty string.

  • separator (str, optional) – The separator to be used between concatenated values. Defaults to an empty string.

Returns:

A new DataColumn containing the concatenated rows of the input columns, or a pyarrow.Scalar if all columns were strings or scalars.

Return type:

DataColumn | pyarrow.Scalar

classmethod equal(column1: DataColumn, column2: DataColumn, /, *, approximate_floats: bool = False, equal_nulls: bool = False) DataColumn[source]#

Compare two DataColumns element-wise.

Parameters:
  • column1 – The first column to compare.

  • column2 – The second column to compare.

  • equal_nulls – Specifies whether null values should be considered equal. Default is False.

  • approximate_floats – Specifies whether floating-point value equality should compensate for rounding errors. Default is False.

Return type:

A DataColumn containing a pyarrow.BooleanArray indicating element-wise equality between the two columns.

classmethod fully_equal(column1: DataColumn, column2: DataColumn, /, *, approximate_floats: bool = False, equal_nulls: bool = False, skip_nulls: bool = False) bool | None[source]#

Check if two DataColumns are fully equal.

Parameters:
  • column1 – The first column to compare.

  • column2 – The second column to compare.

  • approximate_floats (bool, optional) – Whether to consider floats as approximately equal. If True, floating-point comparison will use tolerance. If not specified, the default value is False.

  • equal_nulls (bool, optional) – Whether to consider null values as equal. If True, null values will be treated as equal. If not specified, the default value is False.

  • skip_nulls (bool | None, optional) – Whether to skip null values during comparison. If True, null values will be ignored. If not specified, the default value is False.

Returns:

Returns None if equal_nulls is False and there are Nones, True if both columns are equal, False otherwise.

Return type:

bool

is_null() bool[source]#

Check if the underlying pyarrow.Array is NullArray.

Returns:

whether or not the underlying pyarrow.Array is a NullArray

Return type:

bool

classmethod load(data: Iterable | DataColumn, dtype: DataType = None) DataColumn[source]#

Wrap data (pyarrow.Array, pandas.Series, Iterable) in a new DataColumn object.

Parameters:
  • data – the data to be wrapped

  • dtype – the type of the underlying pyarrow.Array

Return type:

DataColumn

to_pandas() Series[source]#

Force pandas to use PyArrow in the backend by means of ArrowExtensionArray.

Cf. https://pandas.pydata.org/docs/user_guide/pyarrow.html

Return type:

pandas.Series

to_pyarrow() Array[source]#

Return the underlying native pyarrow.array object.

Return type:

pyarrow.Array

property type: DataType#

Return the underlying native pyarrow.array object type.

Return type:

pyarrow.DataType