pyarrow.ChunkedArray¶

class pyarrow.ChunkedArray¶

Bases: pyarrow.lib._PandasConvertible

An array-like composed from a (possibly empty) collection of pyarrow.Arrays

Warning

Do not call this class’s constructor directly.

__init__(*args, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`(args, *kwargs)	Initialize self.
`cast`(self, target_type[, safe])	Cast array values to another data type
`chunk`(self, i)	Select a chunk by its index
`combine_chunks`(self, MemoryPool memory_pool=None)	Flatten this ChunkedArray into a single non-chunked array.
`dictionary_encode`(self)	Compute dictionary-encoded representation of array
`equals`(self, ChunkedArray other)	Return whether the contents of two chunked arrays are equal.
`fill_null`(self, fill_value)	See pyarrow.compute.fill_null docstring for usage.
`filter`(self, mask[, null_selection_behavior])	Select values from a chunked array.
`flatten`(self, MemoryPool memory_pool=None)	Flatten this ChunkedArray.
`format`(self, **kwargs)
`is_null`(self)	Return BooleanArray indicating the null values.
`is_valid`(self)	Return BooleanArray indicating the non-null values.
`iterchunks`(self)
`length`(self)
`slice`(self[, offset, length])	Compute zero-copy slice of this ChunkedArray
`take`(self, indices)	Select values from a chunked array.
`to_numpy`(self)	Return a NumPy copy of this array (experimental).
`to_pandas`(self[, memory_pool, categories, …])	Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
`to_pylist`(self)	Convert to a list of native Python objects.
`to_string`(self, int indent=0, int window=10)	Render a “pretty-printed” string representation of the ChunkedArray
`unique`(self)	Compute distinct elements in array
`validate`(self, *[, full])	Perform validation checks.
`value_counts`(self)	Compute counts of unique elements in array.

Attributes

`chunks`
`data`
`nbytes`	Total number of bytes consumed by the elements of the chunked array.
`null_count`	Number of null entries
`num_chunks`	Number of underlying chunks
`type`

cast(self, target_type, safe=True)¶

Cast array values to another data type

See pyarrow.compute.cast for usage

chunk(self, i)¶

Select a chunk by its index

Parameters: i (int) –
Returns: pyarrow.Array

chunks¶

combine_chunks(self, MemoryPool memory_pool=None)¶

Flatten this ChunkedArray into a single non-chunked array.

Parameters: memory_pool (MemoryPool, default None) – For memory allocations, if required, otherwise use default pool
Returns: result (Array)

data¶

dictionary_encode(self)¶

Compute dictionary-encoded representation of array

Returns: pyarrow.ChunkedArray – Same chunking as the input, all chunks share a common dictionary.

equals(self, ChunkedArray other)¶

Return whether the contents of two chunked arrays are equal.

Parameters: other (pyarrow.ChunkedArray) – Chunked array to compare against.
Returns: are_equal (bool)

fill_null(self, fill_value)¶: See pyarrow.compute.fill_null docstring for usage.

filter(self, mask, null_selection_behavior='drop')¶: Select values from a chunked array. See pyarrow.compute.filter for full usage.

flatten(self, MemoryPool memory_pool=None)¶

Flatten this ChunkedArray. If it has a struct type, the column is flattened into one array per struct field.

Parameters: memory_pool (MemoryPool, default None) – For memory allocations, if required, otherwise use default pool
Returns: result (List[ChunkedArray])

format(self, **kwargs)¶

is_null(self)¶: Return BooleanArray indicating the null values.

is_valid(self)¶: Return BooleanArray indicating the non-null values.

iterchunks(self)¶

length(self)¶

nbytes¶: Total number of bytes consumed by the elements of the chunked array.

null_count¶

Number of null entries

Returns: int

num_chunks¶

Number of underlying chunks

Returns: int

slice(self, offset=0, length=None)¶

Compute zero-copy slice of this ChunkedArray

Parameters

offset (int, default 0) – Offset from start of array to slice
length (int, default None) – Length of slice (default is until end of batch starting from offset)

Returns

sliced (ChunkedArray)

take(self, indices)¶: Select values from a chunked array. See pyarrow.compute.take for full usage.

to_numpy(self)¶

Return a NumPy copy of this array (experimental).

Returns: array (numpy.ndarray)

to_pandas(self, memory_pool=None, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=True, bool timestamp_as_object=False, bool use_threads=True, bool deduplicate_objects=True, bool ignore_metadata=False, bool safe=True, bool split_blocks=False, bool self_destruct=False, types_mapper=None)¶

Convert to a pandas-compatible NumPy array or DataFrame, as appropriate

Parameters

memory_pool (MemoryPool, default None) – Arrow MemoryPool to use for allocations. Uses the default memory pool is not passed.
strings_to_categorical (bool, default False) – Encode string (UTF8) and binary types to pandas.Categorical.
categories (list, default empty) – List of fields that should be returned as pandas.Categorical. Only applies to table-like data structures.
zero_copy_only (bool, default False) – Raise an ArrowException if this function call would require copying the underlying data.
integer_object_nulls (bool, default False) – Cast integers with nulls to objects
date_as_object (bool, default True) – Cast dates to objects. If False, convert to datetime64[ns] dtype.
timestamp_as_object (bool, default False) – Cast non-nanosecond timestamps (np.datetime64) to objects. This is useful if you have timestamps that don’t fit in the normal date range of nanosecond timestamps (1678 CE-2262 CE). If False, all timestamps are converted to datetime64[ns] dtype.
use_threads (bool, default True) – Whether to parallelize the conversion using multiple threads.
deduplicate_objects (bool, default False) – Do not create multiple copies Python objects when created, to save on memory use. Conversion will be slower.
ignore_metadata (bool, default False) – If True, do not use the ‘pandas’ metadata to reconstruct the DataFrame index, if present
safe (bool, default True) – For certain data types, a cast is needed in order to store the data in a pandas DataFrame or Series (e.g. timestamps are always stored as nanoseconds in pandas). This option controls whether it is a safe cast or not.
split_blocks (bool, default False) – If True, generate one internal “block” for each column when creating a pandas.DataFrame from a RecordBatch or Table. While this can temporarily reduce memory note that various pandas operations can trigger “consolidation” which may balloon memory use.
self_destruct (bool, default False) – EXPERIMENTAL: If True, attempt to deallocate the originating Arrow memory while converting the Arrow object to pandas. If you use the object after calling to_pandas with this option it will crash your program.
types_mapper (function, default None) – A function mapping a pyarrow DataType to a pandas ExtensionDtype. This can be used to override the default pandas type for conversion of built-in pyarrow types or in absence of pandas_metadata in the Table schema. The function receives a pyarrow DataType and is expected to return a pandas ExtensionDtype or None if the default conversion should be used for that type. If you have a dictionary mapping, you can pass dict.get as function.

Returns

pandas.Series or pandas.DataFrame depending on type of object

to_pylist(self)¶: Convert to a list of native Python objects.

to_string(self, int indent=0, int window=10)¶: Render a “pretty-printed” string representation of the ChunkedArray

type¶

unique(self)¶

Compute distinct elements in array

Returns: pyarrow.Array

validate(self, *, full=False)¶

Perform validation checks. An exception is raised if validation fails.

By default only cheap validation checks are run. Pass full=True for thorough validation checks (potentially O(n)).

Parameters: full (bool, default False) – If True, run expensive checks, otherwise cheap checks only.
Raises: ArrowInvalid –

value_counts(self)¶

Compute counts of unique elements in array.

Returns: An array of <input type “Values”, int64_t “Counts”> structs

pyarrow.table pyarrow.RecordBatch