pyarrow.ChunkedArray¶
-
class
pyarrow.
ChunkedArray
¶ Bases:
pyarrow.lib._PandasConvertible
An array-like composed from a (possibly empty) collection of pyarrow.Arrays
Warning
Do not call this class’s constructor directly.
-
__init__
(*args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(*args, **kwargs)Initialize self.
cast
(self, target_type[, safe])Cast array values to another data type
chunk
(self, i)Select a chunk by its index
combine_chunks
(self, MemoryPool memory_pool=None)Flatten this ChunkedArray into a single non-chunked array.
dictionary_encode
(self)Compute dictionary-encoded representation of array
equals
(self, ChunkedArray other)Return whether the contents of two chunked arrays are equal.
fill_null
(self, fill_value)See pyarrow.compute.fill_null docstring for usage.
filter
(self, mask[, null_selection_behavior])Select values from a chunked array.
flatten
(self, MemoryPool memory_pool=None)Flatten this ChunkedArray.
format
(self, **kwargs)is_null
(self)Return BooleanArray indicating the null values.
is_valid
(self)Return BooleanArray indicating the non-null values.
iterchunks
(self)length
(self)slice
(self[, offset, length])Compute zero-copy slice of this ChunkedArray
take
(self, indices)Select values from a chunked array.
to_numpy
(self)Return a NumPy copy of this array (experimental).
to_pandas
(self[, memory_pool, categories, …])Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
to_pylist
(self)Convert to a list of native Python objects.
to_string
(self, int indent=0, int window=10)Render a “pretty-printed” string representation of the ChunkedArray
unique
(self)Compute distinct elements in array
validate
(self, *[, full])Perform validation checks.
value_counts
(self)Compute counts of unique elements in array.
Attributes
Total number of bytes consumed by the elements of the chunked array.
Number of null entries
Number of underlying chunks
-
cast
(self, target_type, safe=True)¶ Cast array values to another data type
See pyarrow.compute.cast for usage
-
chunk
(self, i)¶ Select a chunk by its index
- Parameters
i (int) –
- Returns
pyarrow.Array
-
chunks
¶
-
combine_chunks
(self, MemoryPool memory_pool=None)¶ Flatten this ChunkedArray into a single non-chunked array.
- Parameters
memory_pool (MemoryPool, default None) – For memory allocations, if required, otherwise use default pool
- Returns
result (Array)
-
data
¶
-
dictionary_encode
(self)¶ Compute dictionary-encoded representation of array
- Returns
pyarrow.ChunkedArray – Same chunking as the input, all chunks share a common dictionary.
-
equals
(self, ChunkedArray other)¶ Return whether the contents of two chunked arrays are equal.
- Parameters
other (pyarrow.ChunkedArray) – Chunked array to compare against.
- Returns
are_equal (bool)
-
fill_null
(self, fill_value)¶ See pyarrow.compute.fill_null docstring for usage.
-
filter
(self, mask, null_selection_behavior='drop')¶ Select values from a chunked array. See pyarrow.compute.filter for full usage.
-
flatten
(self, MemoryPool memory_pool=None)¶ Flatten this ChunkedArray. If it has a struct type, the column is flattened into one array per struct field.
- Parameters
memory_pool (MemoryPool, default None) – For memory allocations, if required, otherwise use default pool
- Returns
result (List[ChunkedArray])
-
format
(self, **kwargs)¶
-
is_null
(self)¶ Return BooleanArray indicating the null values.
-
is_valid
(self)¶ Return BooleanArray indicating the non-null values.
-
iterchunks
(self)¶
-
length
(self)¶
-
nbytes
¶ Total number of bytes consumed by the elements of the chunked array.
-
null_count
¶ Number of null entries
- Returns
int
-
num_chunks
¶ Number of underlying chunks
- Returns
int
-
slice
(self, offset=0, length=None)¶ Compute zero-copy slice of this ChunkedArray
- Parameters
offset (int, default 0) – Offset from start of array to slice
length (int, default None) – Length of slice (default is until end of batch starting from offset)
- Returns
sliced (ChunkedArray)
-
take
(self, indices)¶ Select values from a chunked array. See pyarrow.compute.take for full usage.
-
to_numpy
(self)¶ Return a NumPy copy of this array (experimental).
- Returns
array (numpy.ndarray)
-
to_pandas
(self, memory_pool=None, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=True, bool timestamp_as_object=False, bool use_threads=True, bool deduplicate_objects=True, bool ignore_metadata=False, bool safe=True, bool split_blocks=False, bool self_destruct=False, types_mapper=None)¶ Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
- Parameters
memory_pool (MemoryPool, default None) – Arrow MemoryPool to use for allocations. Uses the default memory pool is not passed.
strings_to_categorical (bool, default False) – Encode string (UTF8) and binary types to pandas.Categorical.
categories (list, default empty) – List of fields that should be returned as pandas.Categorical. Only applies to table-like data structures.
zero_copy_only (bool, default False) – Raise an ArrowException if this function call would require copying the underlying data.
integer_object_nulls (bool, default False) – Cast integers with nulls to objects
date_as_object (bool, default True) – Cast dates to objects. If False, convert to datetime64[ns] dtype.
timestamp_as_object (bool, default False) – Cast non-nanosecond timestamps (np.datetime64) to objects. This is useful if you have timestamps that don’t fit in the normal date range of nanosecond timestamps (1678 CE-2262 CE). If False, all timestamps are converted to datetime64[ns] dtype.
use_threads (bool, default True) – Whether to parallelize the conversion using multiple threads.
deduplicate_objects (bool, default False) – Do not create multiple copies Python objects when created, to save on memory use. Conversion will be slower.
ignore_metadata (bool, default False) – If True, do not use the ‘pandas’ metadata to reconstruct the DataFrame index, if present
safe (bool, default True) – For certain data types, a cast is needed in order to store the data in a pandas DataFrame or Series (e.g. timestamps are always stored as nanoseconds in pandas). This option controls whether it is a safe cast or not.
split_blocks (bool, default False) – If True, generate one internal “block” for each column when creating a pandas.DataFrame from a RecordBatch or Table. While this can temporarily reduce memory note that various pandas operations can trigger “consolidation” which may balloon memory use.
self_destruct (bool, default False) – EXPERIMENTAL: If True, attempt to deallocate the originating Arrow memory while converting the Arrow object to pandas. If you use the object after calling to_pandas with this option it will crash your program.
types_mapper (function, default None) – A function mapping a pyarrow DataType to a pandas ExtensionDtype. This can be used to override the default pandas type for conversion of built-in pyarrow types or in absence of pandas_metadata in the Table schema. The function receives a pyarrow DataType and is expected to return a pandas ExtensionDtype or
None
if the default conversion should be used for that type. If you have a dictionary mapping, you can passdict.get
as function.
- Returns
pandas.Series or pandas.DataFrame depending on type of object
-
to_pylist
(self)¶ Convert to a list of native Python objects.
-
to_string
(self, int indent=0, int window=10)¶ Render a “pretty-printed” string representation of the ChunkedArray
-
type
¶
-
unique
(self)¶ Compute distinct elements in array
- Returns
pyarrow.Array
-
validate
(self, *, full=False)¶ Perform validation checks. An exception is raised if validation fails.
By default only cheap validation checks are run. Pass full=True for thorough validation checks (potentially O(n)).
- Parameters
full (bool, default False) – If True, run expensive checks, otherwise cheap checks only.
- Raises
ArrowInvalid –
-
value_counts
(self)¶ Compute counts of unique elements in array.
- Returns
An array of <input type “Values”, int64_t “Counts”> structs
-