Python bindings¶

This is the documentation of the Python API of Apache Arrow. For more details on the Arrow format and other language bindings see the parent documentation.

The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.

Here will we detail the usage of the Python API for Arrow and the leaf libraries that add additional functionality such as reading Apache Parquet files into Arrow structures.

Installing PyArrow
- System Compatibility
- Python Compatibility
- Using Conda
- Using Pip
- Installing from source
- Installing Nightly Packages
Memory and IO Interfaces
- Referencing and Allocating Memory
- Input and Output
Data Types and In-Memory Data Model
- Type Metadata
- Schemas
- Arrays
- Record Batches
- Tables
- Custom Schema and Field Metadata
Compute Functions
Streaming, Serialization, and IPC
- Writing and Reading Streams
- Arbitrary Object Serialization
Filesystem Interface
- Usage
- S3
- Hadoop File System (HDFS)
- Using fsspec-compatible filesystems
Filesystem Interface (legacy)
- Hadoop File System (HDFS)
The Plasma In-Memory Object Store
- The Plasma API
- Using Arrow and Pandas with Plasma
- Using Plasma with Huge Pages
NumPy Integration
- NumPy to Arrow
- Arrow to NumPy
Pandas Integration
- DataFrames
- Series
- Handling pandas Indexes
- Type differences
- Memory Usage and Zero Copy
Timestamps
- Arrow/Pandas Timestamps
- Timestamp Conversions
Reading CSV files
- Usage
- Customized parsing
- Customized conversion
- Incremental reading
- Character encoding
- Performance
Feather File Format
- Using Compression
- Writing Version 1 (V1) Files
Reading JSON files
- Usage
- Automatic Type Inference
- Customized parsing
Reading and Writing the Apache Parquet Format
- Obtaining pyarrow with Parquet Support
- Reading and Writing Single Files
- Finer-grained Reading and Writing
- Inspecting the Parquet File Metadata
- Data Type Handling
- Compression, Encoding, and File Compatibility
- Partitioned Datasets (Multiple Files)
- Writing to Partitioned Datasets
- Reading from Partitioned Datasets
- Using with Spark
- Multithreaded Reads
- Reading a Parquet File from Azure Blob storage
Tabular Datasets
- Reading Datasets
- Filtering data
- Reading partitioned data
- Reading from cloud storage
- Reading from Minio
- Working with Parquet Datasets
- Manual specification of the Dataset
- Manual scheduling
CUDA Integration
- CUDA Contexts
- CUDA Buffers
- Numba Integration
Extending pyarrow
- Controlling conversion to pyarrow.Array with the __arrow_array__ protocol
- Defining extension types (“user-defined types”)
Using pyarrow from C++ and Cython Code
- C++ API
- Cython API
API Reference
- Data Types and Schemas
- Arrays and Scalars
- Buffers and Memory
- Compute Functions
- Streams and File Access
- Tables and Tensors
- Serialization and IPC
- Arrow Flight
- Tabular File Formats
- Filesystems
- Dataset
- Plasma In-Memory Object Store
- CUDA Integration
- Miscellaneous
Getting Involved
Benchmarks
- Running the benchmarks
- Running for arbitrary Git revisions
- Compatibility

Reading/Writing IPC formats Installing PyArrow