DataColumn#

The DataColumn class is the foundational abstraction for all tabular operations in Data Curator. It wraps a pyarrow.Array and adds element-wise operations, comparison logic, and composability for calculated columns.

It powers the calculated column system, Boolean logic in filters, type-safe arithmetic across datasets, and more.

Overview#

At its core, DataColumn:

  • Encapsulates a pyarrow.Array.

  • Enables arithmetic and comparison operations (+, -, ==, //, etc.).

  • Ensures null propagation and broadcasting consistency.

  • Supports composable transformations for use in custom calculations.

Basic Usage#

from kaxanuk.data_curator.modules.data_column import DataColumn

col_a = DataColumn.load([1, 2, 3])
col_b = DataColumn.load([10, 20, 30])

result = col_a + col_b   # Element-wise addition
filtered = col_a > 1     # Element-wise comparison returns boolean DataColumn

result.to_pandas()       # Export to pandas
result.to_pyarrow()      # Export to pyarrow

Arithmetic Operators#

You can apply arithmetic operations directly using standard Python syntax:

  • + (via __add__)

  • - (via __sub__)

  • * (via __mul__)

  • / (via __truediv__)

  • // (via __floordiv__)

  • % (via __mod__)

Reflected versions like 3 + col also work thanks to:

  • __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rmod__

All operations return a new DataColumn, with null-aware and type-safe behavior:

col = DataColumn.load([2, 4, 6])

col + 1         # [3, 5, 7]
col * 2         # [4, 8, 12]
10 - col        # [8, 6, 4]
col / 2         # [1.0, 2.0, 3.0]

Comparison Operators#

DataColumn supports element-wise comparison using standard Python syntax:

  • == (via __eq__)

  • != (via __ne__)

  • < (via __lt__)

  • <= (via __le__)

  • > (via __gt__)

  • >= (via __ge__)

Each of these returns a new DataColumn of boolean values:

col = DataColumn.load([5, 10, 15])

col > 7       # [False, True, True]
col == 10     # [False, True, False]
col <= col    # [True, True, True]

Logical Functions#

Combine multiple boolean columns using:

  • DataColumn.boolean_and(…)

  • DataColumn.boolean_or(…)

These work across multiple DataColumn objects, pyarrow scalars or booleans. Optionally enable allow_null_comparisons=True for Kleene logic.

DataColumn.boolean_and(col1 > 0, col2 < 100)
DataColumn.boolean_or(col1.is_null(), col2 == 5)

Equality Utilities#

To compare full columns for equality:

  • equal(…) — element-wise

  • fully_equal(…) — total match (returns True, False or None)

DataColumn.equal(col1, col2, equal_nulls=True)
DataColumn.fully_equal(col1, col2, skip_nulls=True)

String Concatenation#

Use concatenate to merge string-type columns or scalars:

DataColumn.concatenate(col1, col2, separator="-", null_replacement="N/A")

This returns a new DataColumn with joined string values.

Loading and Conversion#

You can create and convert DataColumn objects easily:

  • DataColumn.load(…) — from list, pandas.Series, or pyarrow.Array

  • .to_pandas() — convert to pandas.Series

  • .to_pyarrow() — convert to pyarrow.Array

  • .type — get native pyarrow type

  • .is_null() — detect if the array is fully null

Advanced Behavior#

Null-safe math and broadcasting are internally managed through helper methods:

  • _mask_dual_array_nulls(…)

  • _replace_array_mask_with_nones(…)

  • _return_null_column_on_null_operand(…)

These ensure safe and predictable behavior in pipelines, especially in user-defined calculations.

API Reference#