pyarrow.dataset.FileSystemDataset¶

class pyarrow.dataset.FileSystemDataset¶

Bases: pyarrow._dataset.Dataset

A Dataset of file fragments.

A FileSystemDataset is composed of one or more FileFragment.

Parameters

fragments (list[Fragments]) – List of fragments to consume.
schema (Schema) – The top-level schema of the Dataset.
format (FileFormat) – File format of the fragments, currently only ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
filesystem (FileSystem) – FileSystem of the fragments.
root_partition (Expression, optional) – The top-level partition of the DataDataset.

__init__(*args, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`(args, *kwargs)	Initialize self.
`from_paths`	A Dataset created from a list of paths on a particular filesystem.
`get_fragments`	Returns an iterator over the fragments in this dataset.
`replace_schema`	Return a copy of this Dataset with a different schema.
`scan`	Builds a scan operation against the dataset.
`to_batches`	Read the dataset as materialized record batches.
`to_table`	Read the dataset to an arrow table.

Attributes

`files`	List of the files
`filesystem`
`format`	The FileFormat of this source.
`partition_expression`	An Expression which evaluates to true for all data viewed by this Dataset.
`schema`	The common schema of the full Dataset

files¶: List of the files

filesystem¶

format¶: The FileFormat of this source.

from_paths()¶

A Dataset created from a list of paths on a particular filesystem.

Parameters

paths (list of str) – List of file paths to create the fragments from.
schema (Schema) – The top-level schema of the DataDataset.
format (FileFormat) – File format to create fragments from, currently only ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
filesystem (FileSystem) – The filesystem which files are from.
partitions (List[Expression], optional) – Attach additional partition information for the file paths.
root_partition (Expression, optional) – The top-level partition of the DataDataset.

get_fragments()¶

Returns an iterator over the fragments in this dataset.

Parameters: filter (Expression, default None) – Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.
Returns: fragments (iterator of Fragment)

partition_expression¶: An Expression which evaluates to true for all data viewed by this Dataset.

replace_schema()¶

Return a copy of this Dataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

scan()¶

Builds a scan operation against the dataset.

It produces a stream of ScanTasks which is meant to be a unit of work to be dispatched. The tasks are not executed automatically, the user is responsible to execute and dispatch the individual tasks, so custom local task scheduling can be implemented.

Parameters

columns (list of str, default None) – List of columns to project. Order and duplicates will be preserved. The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.
filter (Expression, default None) – Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
batch_size (int, default 1M) – The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
use_threads (bool, default True) – If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
memory_pool (MemoryPool, default None) – For memory allocations, if required. If not specified, uses the default pool.

Returns

scan_tasks (iterator of ScanTask)

schema¶: The common schema of the full Dataset

to_batches()¶

Read the dataset as materialized record batches.

Builds a scan operation against the dataset and sequentially executes the ScanTasks as the returned generator gets consumed.

See scan method parameters documentation.

Returns: record_batches (iterator of RecordBatch)

to_table()¶

Read the dataset to an arrow table.

Note that this method reads all the selected data from the dataset into memory.

See scan method parameters documentation.

Returns: table (Table instance)

pyarrow.dataset.Dataset pyarrow.dataset.FileSystemFactoryOptions