Apache Arrow¶
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
The project is developing a multi-language collection of libraries for solving systems problems related to in-memory analytical data processing. This includes such topics as:
Zero-copy shared memory and RPC-based data movement
Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet)
In-memory analytics and query processing
- Specifications and Protocols
- Format Versioning and Stability
- Arrow Columnar Format
- Arrow Flight RPC
- Integration Testing
- The Arrow C data interface
- The Arrow C stream interface
- Other Data Structures
- Libraries
- Implementation Status
- C/GLib
- C++
- User Guide
- High-Level Overview
- Conventions
- Using Arrow C++ in your own project
- Memory Management
- Arrays
- Data Types
- Tabular Data
- Compute Functions
- Input / output and filesystems
- Reading and writing the Arrow IPC format
- Reading and writing Parquet files
- Reading CSV files
- Reading JSON files
- Arrow Flight RPC
- Examples
- API Reference
- User Guide
- C#
- Go
- Java
- JavaScript
- Julia
- MATLAB
- Python
- Installing PyArrow
- Memory and IO Interfaces
- Data Types and In-Memory Data Model
- Compute Functions
- Streaming, Serialization, and IPC
- Filesystem Interface
- Filesystem Interface (legacy)
- Hadoop File System (HDFS)
- HDFS API
- pyarrow.hdfs.connect
- pyarrow.HadoopFileSystem.cat
- pyarrow.HadoopFileSystem.chmod
- pyarrow.HadoopFileSystem.chown
- pyarrow.HadoopFileSystem.delete
- pyarrow.HadoopFileSystem.df
- pyarrow.HadoopFileSystem.disk_usage
- pyarrow.HadoopFileSystem.download
- pyarrow.HadoopFileSystem.exists
- pyarrow.HadoopFileSystem.get_capacity
- pyarrow.HadoopFileSystem.get_space_used
- pyarrow.HadoopFileSystem.info
- pyarrow.HadoopFileSystem.ls
- pyarrow.HadoopFileSystem.mkdir
- pyarrow.HadoopFileSystem.open
- pyarrow.HadoopFileSystem.rename
- pyarrow.HadoopFileSystem.rm
- pyarrow.HadoopFileSystem.upload
- pyarrow.HdfsFile
- HDFS API
- Hadoop File System (HDFS)
- The Plasma In-Memory Object Store
- NumPy Integration
- Pandas Integration
- Timestamps
- Reading CSV files
- Feather File Format
- Reading JSON files
- Reading and Writing the Apache Parquet Format
- Obtaining pyarrow with Parquet Support
- Reading and Writing Single Files
- Finer-grained Reading and Writing
- Inspecting the Parquet File Metadata
- Data Type Handling
- Compression, Encoding, and File Compatibility
- Partitioned Datasets (Multiple Files)
- Writing to Partitioned Datasets
- Reading from Partitioned Datasets
- Using with Spark
- Multithreaded Reads
- Reading a Parquet File from Azure Blob storage
- Tabular Datasets
- CUDA Integration
- Extending pyarrow
- Using pyarrow from C++ and Cython Code
- API Reference
- Data Types and Schemas
- Factory Functions
- pyarrow.null
- pyarrow.bool_
- pyarrow.int8
- pyarrow.int16
- pyarrow.int32
- pyarrow.int64
- pyarrow.uint8
- pyarrow.uint16
- pyarrow.uint32
- pyarrow.uint64
- pyarrow.float16
- pyarrow.float32
- pyarrow.float64
- pyarrow.time32
- pyarrow.time64
- pyarrow.timestamp
- pyarrow.date32
- pyarrow.date64
- pyarrow.binary
- pyarrow.string
- pyarrow.utf8
- pyarrow.large_binary
- pyarrow.large_string
- pyarrow.large_utf8
- pyarrow.decimal128
- pyarrow.list_
- pyarrow.large_list
- pyarrow.map_
- pyarrow.struct
- pyarrow.dictionary
- pyarrow.field
- pyarrow.schema
- pyarrow.from_numpy_dtype
- Type Classes
- pyarrow.DataType
- pyarrow.DictionaryType
- pyarrow.ListType
- pyarrow.MapType
- pyarrow.StructType
- pyarrow.UnionType
- pyarrow.TimestampType
- pyarrow.Time32Type
- pyarrow.Time64Type
- pyarrow.FixedSizeBinaryType
- pyarrow.Decimal128Type
- pyarrow.Field
- pyarrow.Schema
- pyarrow.ExtensionType
- pyarrow.PyExtensionType
- pyarrow.register_extension_type
- pyarrow.unregister_extension_type
- Type Checking
- pyarrow.types.is_boolean
- pyarrow.types.is_integer
- pyarrow.types.is_signed_integer
- pyarrow.types.is_unsigned_integer
- pyarrow.types.is_int8
- pyarrow.types.is_int16
- pyarrow.types.is_int32
- pyarrow.types.is_int64
- pyarrow.types.is_uint8
- pyarrow.types.is_uint16
- pyarrow.types.is_uint32
- pyarrow.types.is_uint64
- pyarrow.types.is_floating
- pyarrow.types.is_float16
- pyarrow.types.is_float32
- pyarrow.types.is_float64
- pyarrow.types.is_decimal
- pyarrow.types.is_list
- pyarrow.types.is_large_list
- pyarrow.types.is_struct
- pyarrow.types.is_union
- pyarrow.types.is_nested
- pyarrow.types.is_temporal
- pyarrow.types.is_timestamp
- pyarrow.types.is_date
- pyarrow.types.is_date32
- pyarrow.types.is_date64
- pyarrow.types.is_time
- pyarrow.types.is_time32
- pyarrow.types.is_time64
- pyarrow.types.is_null
- pyarrow.types.is_binary
- pyarrow.types.is_unicode
- pyarrow.types.is_string
- pyarrow.types.is_large_binary
- pyarrow.types.is_large_unicode
- pyarrow.types.is_large_string
- pyarrow.types.is_fixed_size_binary
- pyarrow.types.is_map
- pyarrow.types.is_dictionary
- Factory Functions
- Arrays and Scalars
- Factory Functions
- Array Types
- pyarrow.Array
- pyarrow.BooleanArray
- pyarrow.FloatingPointArray
- pyarrow.IntegerArray
- pyarrow.Int8Array
- pyarrow.Int16Array
- pyarrow.Int32Array
- pyarrow.Int64Array
- pyarrow.NullArray
- pyarrow.NumericArray
- pyarrow.UInt8Array
- pyarrow.UInt16Array
- pyarrow.UInt32Array
- pyarrow.UInt64Array
- pyarrow.BinaryArray
- pyarrow.StringArray
- pyarrow.FixedSizeBinaryArray
- pyarrow.LargeBinaryArray
- pyarrow.LargeStringArray
- pyarrow.Time32Array
- pyarrow.Time64Array
- pyarrow.Date32Array
- pyarrow.Date64Array
- pyarrow.TimestampArray
- pyarrow.Decimal128Array
- pyarrow.DictionaryArray
- pyarrow.ListArray
- pyarrow.LargeListArray
- pyarrow.StructArray
- pyarrow.UnionArray
- pyarrow.ExtensionArray
- Scalars
- pyarrow.scalar
- pyarrow.NA
- pyarrow.Scalar
- pyarrow.BooleanScalar
- pyarrow.Int8Scalar
- pyarrow.Int16Scalar
- pyarrow.Int32Scalar
- pyarrow.Int64Scalar
- pyarrow.UInt8Scalar
- pyarrow.UInt16Scalar
- pyarrow.UInt32Scalar
- pyarrow.UInt64Scalar
- pyarrow.FloatScalar
- pyarrow.DoubleScalar
- pyarrow.BinaryScalar
- pyarrow.StringScalar
- pyarrow.FixedSizeBinaryScalar
- pyarrow.LargeBinaryScalar
- pyarrow.LargeStringScalar
- pyarrow.Time32Scalar
- pyarrow.Time64Scalar
- pyarrow.Date32Scalar
- pyarrow.Date64Scalar
- pyarrow.TimestampScalar
- pyarrow.Decimal128Scalar
- pyarrow.DictionaryScalar
- pyarrow.ListScalar
- pyarrow.LargeListScalar
- pyarrow.StructScalar
- pyarrow.UnionScalar
- Buffers and Memory
- Compute Functions
- Aggregations
- Arithmetic Functions
- Comparisons
- Logical Functions
- String Predicates
- pyarrow.compute.ascii_is_alnum
- pyarrow.compute.ascii_is_alpha
- pyarrow.compute.ascii_is_decimal
- pyarrow.compute.ascii_is_lower
- pyarrow.compute.ascii_is_printable
- pyarrow.compute.ascii_is_space
- pyarrow.compute.ascii_is_upper
- pyarrow.compute.utf8_is_alnum
- pyarrow.compute.utf8_is_alpha
- pyarrow.compute.utf8_is_decimal
- pyarrow.compute.utf8_is_digit
- pyarrow.compute.utf8_is_lower
- pyarrow.compute.utf8_is_numeric
- pyarrow.compute.utf8_is_printable
- pyarrow.compute.utf8_is_space
- pyarrow.compute.utf8_is_upper
- pyarrow.compute.ascii_is_title
- pyarrow.compute.utf8_is_title
- pyarrow.compute.string_is_ascii
- String Transforms
- Containment tests
- Conversions
- Selections
- Associative transforms
- Sorts and partitions
- Structural Transforms
- Streams and File Access
- Tables and Tensors
- Serialization and IPC
- Inter-Process Communication
- pyarrow.ipc.new_file
- pyarrow.ipc.open_file
- pyarrow.ipc.new_stream
- pyarrow.ipc.open_stream
- pyarrow.ipc.read_message
- pyarrow.ipc.read_record_batch
- pyarrow.ipc.get_record_batch_size
- pyarrow.ipc.read_tensor
- pyarrow.ipc.write_tensor
- pyarrow.ipc.get_tensor_size
- pyarrow.ipc.Message
- pyarrow.ipc.MessageReader
- pyarrow.ipc.RecordBatchFileReader
- pyarrow.ipc.RecordBatchFileWriter
- pyarrow.ipc.RecordBatchStreamReader
- pyarrow.ipc.RecordBatchStreamWriter
- Serialization
- Inter-Process Communication
- Arrow Flight
- Tabular File Formats
- Filesystems
- Dataset
- Factory functions
- Classes
- pyarrow.dataset.FileFormat
- pyarrow.dataset.ParquetFileFormat
- pyarrow.dataset.Partitioning
- pyarrow.dataset.PartitioningFactory
- pyarrow.dataset.DirectoryPartitioning
- pyarrow.dataset.HivePartitioning
- pyarrow.dataset.Dataset
- pyarrow.dataset.FileSystemDataset
- pyarrow.dataset.FileSystemFactoryOptions
- pyarrow.dataset.FileSystemDatasetFactory
- pyarrow.dataset.UnionDataset
- pyarrow.dataset.Scanner
- pyarrow.dataset.Expression
- Plasma In-Memory Object Store
- CUDA Integration
- Miscellaneous
- Data Types and Schemas
- Getting Involved
- Benchmarks
- R
- Ruby
- Rust
- Development
- Contributing to Apache Arrow
- C++ Development
- Python Development
- Daily Development using Archery
- Packaging and Testing with Crossbow
- Running Docker Builds
- Benchmarks
- Building the Documentation