pyarrow.parquet.write_to_dataset¶
-
pyarrow.parquet.
write_to_dataset
(table, root_path, partition_cols=None, partition_filename_cb=None, filesystem=None, use_legacy_dataset=True, **kwargs)[source]¶ Wrapper around parquet.write_table for writing a Table to Parquet format by partitions. For each combination of partition columns and values, a subdirectories are created in the following manner:
- root_dir/
- group1=value1
- group2=value1
<uuid>.parquet
- group2=value2
<uuid>.parquet
- group1=valueN
- group2=value1
<uuid>.parquet
- group2=valueN
<uuid>.parquet
- Parameters
table (pyarrow.Table) –
root_path (str, pathlib.Path) – The root directory of the dataset
filesystem (FileSystem, default None) – If nothing passed, paths assumed to be found in the local on-disk filesystem
partition_cols (list,) – Column names by which to partition the dataset Columns are partitioned in the order they are given
partition_filename_cb (callable,) – A callback function that takes the partition key(s) as an argument and allow you to override the partition filename. If nothing is passed, the filename will consist of a uuid.
use_legacy_dataset (bool, default True) – Set to False to enable the new code path (experimental, using the new Arrow Dataset API). This is more efficient when using partition columns, but does not (yet) support partition_filename_cb and metadata_collector keywords.
**kwargs (dict,) – Additional kwargs for write_table function. See docstring for write_table or ParquetWriter for more information. Using metadata_collector in kwargs allows one to collect the file metadata instances of dataset pieces. The file paths in the ColumnChunkMetaData will be set relative to root_path.