DataColumn#
The DataColumn class is the foundational abstraction for all tabular operations in Data Curator. It wraps a pyarrow.Array and adds element-wise operations, comparison logic, and composability for calculated columns.
It powers the calculated column system, Boolean logic in filters, type-safe arithmetic across datasets, and more.
Overview#
At its core, DataColumn:
Encapsulates a pyarrow.Array.
Enables arithmetic and comparison operations (+, -, ==, //, etc.).
Ensures null propagation and broadcasting consistency.
Supports composable transformations for use in custom calculations.
Basic Usage#
from kaxanuk.data_curator.modules.data_column import DataColumn
col_a = DataColumn.load([1, 2, 3])
col_b = DataColumn.load([10, 20, 30])
result = col_a + col_b # Element-wise addition
filtered = col_a > 1 # Element-wise comparison returns boolean DataColumn
result.to_pandas() # Export to pandas
result.to_pyarrow() # Export to pyarrow
Arithmetic Operators#
You can apply arithmetic operations directly using standard Python syntax:
+ (via __add__)
- (via __sub__)
* (via __mul__)
/ (via __truediv__)
// (via __floordiv__)
% (via __mod__)
Reflected versions like 3 + col also work thanks to:
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rmod__
All operations return a new DataColumn, with null-aware and type-safe behavior:
col = DataColumn.load([2, 4, 6])
col + 1 # [3, 5, 7]
col * 2 # [4, 8, 12]
10 - col # [8, 6, 4]
col / 2 # [1.0, 2.0, 3.0]
Comparison Operators#
DataColumn supports element-wise comparison using standard Python syntax:
== (via __eq__)
!= (via __ne__)
< (via __lt__)
<= (via __le__)
> (via __gt__)
>= (via __ge__)
Each of these returns a new DataColumn of boolean values:
col = DataColumn.load([5, 10, 15])
col > 7 # [False, True, True]
col == 10 # [False, True, False]
col <= col # [True, True, True]
Logical Functions#
Combine multiple boolean columns using:
DataColumn.boolean_and(…)
DataColumn.boolean_or(…)
These work across multiple DataColumn objects, pyarrow scalars or booleans. Optionally enable allow_null_comparisons=True for Kleene logic.
DataColumn.boolean_and(col1 > 0, col2 < 100)
DataColumn.boolean_or(col1.is_null(), col2 == 5)
Equality Utilities#
To compare full columns for equality:
equal(…) — element-wise
fully_equal(…) — total match (returns True, False or None)
DataColumn.equal(col1, col2, equal_nulls=True)
DataColumn.fully_equal(col1, col2, skip_nulls=True)
String Concatenation#
Use concatenate to merge string-type columns or scalars:
DataColumn.concatenate(col1, col2, separator="-", null_replacement="N/A")
This returns a new DataColumn with joined string values.
Loading and Conversion#
You can create and convert DataColumn objects easily:
DataColumn.load(…) — from list, pandas.Series, or pyarrow.Array
.to_pandas() — convert to pandas.Series
.to_pyarrow() — convert to pyarrow.Array
.type — get native pyarrow type
.is_null() — detect if the array is fully null
Advanced Behavior#
Null-safe math and broadcasting are internally managed through helper methods:
_mask_dual_array_nulls(…)
_replace_array_mask_with_nones(…)
_return_null_column_on_null_operand(…)
These ensure safe and predictable behavior in pipelines, especially in user-defined calculations.
API Reference#
- class DataColumn(array: Array, /)[source]#
Bases:
object- MAX_FLOAT_EPSILON_UNITS_DISCREPANCY = 128#
- classmethod boolean_and(*columns: DataColumn | Scalar | bool, allow_null_comparisons: bool = False) DataColumn[source]#
Perform a logical AND comparison on multiple DataColumns.
- Parameters:
*columns – The columns to be combined with boolean AND logic.
allow_null_comparisons – Whether to allow null comparisons with Kleene logic. Default is False, which outputs null on any row containing any null value.
- Return type:
A new DataColumn containing the result of the logical AND comparison.
- classmethod boolean_or(*columns: DataColumn | Scalar | bool, allow_null_comparisons: bool = False) DataColumn[source]#
Perform a logical OR comparison on multiple DataColumns.
- Parameters:
*columns – The columns to be combined with boolean OR logic.
allow_null_comparisons – Whether to allow null comparisons with Kleene logic. Default is False, which outputs null on any row containing any null value.
- Return type:
A new DataColumn containing the result of the logical OR comparison.
- classmethod concatenate(*columns: DataColumn | Scalar | str, null_replacement: str = '', separator: str = '') DataColumn | Scalar[source]#
Concatenate DataColumns into one DataColumn.
- Parameters:
*columns ('DataColumn' | pyarrow.Scalar | str) – The columns to be concatenated. Each column can be either a ‘DataColumn’ object, a pyarrow.Scalar, or a string.
null_replacement (str, optional) – The value to be used as replacement for null values in the concatenated result. Defaults to an empty string.
separator (str, optional) – The separator to be used between concatenated values. Defaults to an empty string.
- Returns:
A new DataColumn containing the concatenated rows of the input columns, or a pyarrow.Scalar if all columns were strings or scalars.
- Return type:
DataColumn | pyarrow.Scalar
- classmethod equal(column1: DataColumn, column2: DataColumn, /, *, approximate_floats: bool = False, equal_nulls: bool = False) DataColumn[source]#
Compare two DataColumns element-wise.
- Parameters:
column1 – The first column to compare.
column2 – The second column to compare.
equal_nulls – Specifies whether null values should be considered equal. Default is False.
approximate_floats – Specifies whether floating-point value equality should compensate for rounding errors. Default is False.
- Return type:
A DataColumn containing a pyarrow.BooleanArray indicating element-wise equality between the two columns.
- classmethod fully_equal(column1: DataColumn, column2: DataColumn, /, *, approximate_floats: bool = False, equal_nulls: bool = False, skip_nulls: bool = False) bool | None[source]#
Check if two DataColumns are fully equal.
- Parameters:
column1 – The first column to compare.
column2 – The second column to compare.
approximate_floats (bool, optional) – Whether to consider floats as approximately equal. If True, floating-point comparison will use tolerance. If not specified, the default value is False.
equal_nulls (bool, optional) – Whether to consider null values as equal. If True, null values will be treated as equal. If not specified, the default value is False.
skip_nulls (bool | None, optional) – Whether to skip null values during comparison. If True, null values will be ignored. If not specified, the default value is False.
- Returns:
Returns None if equal_nulls is False and there are Nones, True if both columns are equal, False otherwise.
- Return type:
bool
- is_null() bool[source]#
Check if the underlying pyarrow.Array is NullArray.
- Returns:
whether or not the underlying pyarrow.Array is a NullArray
- Return type:
bool
- classmethod load(data: Iterable | DataColumn, dtype: DataType = None) DataColumn[source]#
Wrap data (pyarrow.Array, pandas.Series, Iterable) in a new DataColumn object.
- Parameters:
data – the data to be wrapped
dtype – the type of the underlying pyarrow.Array
- Return type:
- to_pandas() Series[source]#
Force pandas to use PyArrow in the backend by means of ArrowExtensionArray.
Cf. https://pandas.pydata.org/docs/user_guide/pyarrow.html
- Return type:
pandas.Series
- to_pyarrow() Array[source]#
Return the underlying native pyarrow.array object.
- Return type:
pyarrow.Array
- property type: DataType#
Return the underlying native pyarrow.array object type.
- Return type:
pyarrow.DataType