pyarrow.dataset.partitioning¶
-
pyarrow.dataset.
partitioning
(schema=None, field_names=None, flavor=None, dictionaries=None)[source]¶ Specify a partitioning scheme.
The supported schemes include:
“DirectoryPartitioning”: this scheme expects one segment in the file path for each field in the specified schema (all fields are required to be present). For example given schema<year:int16, month:int8> the path “/2009/11” would be parsed to (“year”_ == 2009 and “month”_ == 11).
“HivePartitioning”: a scheme for “/$key=$value/” nested directories as found in Apache Hive. This is a multi-level, directory based partitioning scheme. Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names. Field order is ignored, as are missing or unrecognized field names. For example, given schema<year:int16, month:int8, day:int8>, a possible path would be “/year=2009/month=11/day=15” (but the field order does not need to match).
- Parameters
schema (pyarrow.Schema, default None) – The schema that describes the partitions present in the file path. If not specified, and field_names and/or flavor are specified, the schema will be inferred from the file path (and a PartitioningFactory is returned).
field_names (list of str, default None) – A list of strings (field names). If specified, the schema’s types are inferred from the file paths (only valid for DirectoryPartitioning).
flavor (str, default None) – The default is DirectoryPartitioning. Specify
flavor="hive"
for a HivePartitioning.dictionaries (List[Array]) – If the type of any field of schema is a dictionary type, the corresponding entry of dictionaries must be an array containing every value which may be taken by the corresponding column or an error will be raised in parsing.
- Returns
Partitioning or PartitioningFactory
Examples
Specify the Schema for paths like “/2009/June”:
>>> partitioning(pa.schema([("year", pa.int16()), ("month", pa.string())]))
or let the types be inferred by only specifying the field names:
>>> partitioning(field_names=["year", "month"])
For paths like “/2009/June”, the year will be inferred as int32 while month will be inferred as string.
Create a Hive scheme for a path like “/year=2009/month=11”:
>>> partitioning( ... pa.schema([("year", pa.int16()), ("month", pa.int8())]), ... flavor="hive")
A Hive scheme can also be discovered from the directory structure (and types will be inferred):
>>> partitioning(flavor="hive")