class: center, middle  # Apache Arrow ## A cross-language development platform for in-memory data Joris Van den Bossche, EuroScipy, September 4, 2019 https://github.com/jorisvandenbossche/talks/ [@jorisvdbossche](https://twitter.com/jorisvdbossche) --- # About me Joris Van den Bossche - Background: PhD bio-science engineer, air quality research - Open source enthusiast: pandas core dev, geopandas maintainer, scikit-learn contributor - Currently working part-time at Ursa Labs on Apache Arrow https://github.com/jorisvandenbossche Twitter: [@jorisvdbossche](https://twitter.com/jorisvdbossche)
.center[ .affiliations[   ] ] --- class: center, middle  --- .center[  ] * Open source initiative conceived in 2015 * Intersection of database systems, big data, and data science tools * Purpose: Cross-language open standards and libraries to accelerate and simplify in-memory computing * https://github.com/apache/arrow --- count: false .center[  ] * Open source initiative conceived in 2015 * Intersection of database systems, big data, and data science tools * Purpose: Cross-language **open standards** and libraries to accelerate and simplify in-memory computing * https://github.com/apache/arrow --- class: middle, center # Open standards ??? give some examples: * Human-readable semi-structured data: XML, JSON * Structured data query language: SQL * Binary storage formats (with metadata) * NetCDF * HDF5 * Apache Parquet, ORC * Serialization / RPC protocols * Apache Avro * Protocol buffers * Not an open standard: Excel, CSV (grrrr) Open standards: why do they matter? * Simplify system architectures * Reduce ecosystem fragmentation * Improve interoperability * Reuse more libraries and algorithms --- # Standardized in-memory data -- count: false * Example: numpy - strided ndarray memory layout ??? numpy's strided memory layout -> of course not limited to numpy, it's more universal, but it still is a standard no-copy sharing of data (memoryviews) buffer protocol -- count: false * Why? * Zero-overhead memory sharing between libraries in-memory and processes via shared memory (without serialization) * Reuse computational algorithms * Reuse IO / storage code ??? Why create open standards for in-memory? * Reuse computational algorithms * Reuse IO / file format interfaces * Move data structures without serializing * Zero-copy shared memory access --- # Columnar data - "data frames" * Notoriously not based on open standards -- count: false * Python (pandas DataFrame), R (base data.frame, data.table), Julia, Spark, SQL databases, ... ??? Where are they found? * Internals of SQL databases * Big data systems (Apache Spark, Apache Hive) * In-memory data frame libraries: Python (pandas), R (base, data.table), Julia (DataFrames.jl) -- count: false → Each has it's own "proprietary" memory layout -- count: false → Little code reuse across projects + high serialization costs --- # Traditional analytic database / dataframe .left-column[  ] -- count: false .right-column[ .small[ * Tightly coupled components * Incompatible memory representations * Cannot share data without serialization * Cannot share algorithms because implementation depends on memory representation ] ] ??? Pandas has a lot of code for IO, has its own "block manager" memory layout, has a lot of code to perform operations on this, and its own API. All this is very tightly integreated. For example, pandas has quite some code written in cython for optimizing the groupby operations. But currently that is very hard for someone else to reuse. And R has all this of its own, each database has all this of its own, ... Incompatible memory representations Cannot share data without serialization Cannot share algorithms because implementation depends on memory representation --- class: center  -- count: false
↓
 --- # "Deconstructed" database .left-column[  ] .right-column[ * Components have public APIs * Use what you need * Different front ends can be developed ] --- # "Deconstructed" database .left-column[  ] .right-column[ .larger[ **Arrow is front end agnostic** ] ] --- .center[  ] * Open standards: * Standardized language-independent columnar memory layout * Messaging protocol for serialization and interprocess communication (IPC) --- ## Arrow's Columnar Memory Format .center[] * More native data types: string, date, decimal, nested data, ... * Everyting is nullable * Memory can be chunked * Optimized layout for efficient access and processing --- .center[  ] * Open standards: * Standardized language-independent columnar memory layout * Messaging protocol for serialization and interprocess communication (IPC) * Libraries: * Implementations of those specifications in many languages (C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby and Rust) * Computational tools based on those standards --- # Python library: `pyarrow` * Conversion utilities for Python <-> Arrow * Interface to Arrow functionalities .medium[ ```python df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]}) import pyarrow as pa table = pa.Table.from_pandas(df) ``` ] --- # Arrow use cases * Data access * Data movement * Runtime data structures for analytics --- class: center, middle # Example use cases --- # Arrow use cases * **Data access** * Read and write widely used storage formats * Interact with database protocols, other data sources * Data movement * Runtime data structures for analytics --- # Apache Parquet * Standardized, binary, columnar storage format * Originally from the Hadoop ecosystem -- count: false Reading and writing Parquet files with pandas (powered by implementation in Apache Arrow): ```python import pandas as pd df = pd.read_parquet("dataset.parquet") ``` - High performance - High compression - Query the data you need - Preserves types ??? compared to csv: faster reading / writing, smaller file on disk, preserving data types but also non-python specific (eg compared to pickle) --- # Apache Parquet Python: ```python df = pd.read_parquet("dataset.parquet") ``` R: ```r df <- read_parquet("dataset.parquet") ``` MATLAB: ```python T = parquetread("dataset.parquet") ``` All using the same Arrow implementation as pandas! --- # Turbodbc  * Access relational databases via the Open Database Connectivity (ODBC) interface * Built-in Arrow support: batched data transfer instead of single-record communication as other popular ODBC modules do. https://github.com/blue-yonder/turbodbc --- # Other data access examples * Supported formats in Apache Arrow: Parquet, CSV, line-delimited JSON (more coming) * [kartothek](https://kartothek.readthedocs.io/en/latest/): managing dataset collections
1
* TensforFlow dataset based on Arrow: `ArrowDataset`
2
* Vaex: arrow as optional way to open DataFrames
3
* ... .credits[
1
https://kartothek.readthedocs.io/en/latest/
2
https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f
3
https://docs.vaex.io/ ] ??? Vaex: optional way to open datasets (although next to parquet, they use the messaging protocol as storage -> memory map) --- # Arrow use cases * Data access * **Data movement** * Zero-copy interprocess communication * Efficient RPC / client-server communications * Pass data structures across language boundaries in-memory without copying (e.g. Java to C++) * Runtime data structures for analytics --- ## PySpark user-defined functions Arrow-accelerated Python + Apache Spark * Joint work with Li Jin from Two Sigma, Bryan Cutler from IBM * Vectorized user-defined functions, fast data export to pandas ```python import pandas as pd from scipy import stats @pandas_udf('double') def cdf(s): return pd.Series(stats.norm.cdf(s)) df.withColumn('cumulative_probability', cdf(df.v)) ``` --- ## PySpark user-defined functions  .credits[ Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html ] ??? --- # Arrow Flight RPC Framework * When shared memory is not an option * Arrow-native RPC framework based on gRPC * High performance communication * Implement custom data services that send and receive Arrow columnar data natively ??? We are developing a new Arrow-native RPC framework, Arrow Flight, based on gRPC for high performance Arrow-based messaging. Through low-level extensions to gRPC’s internal memory management, we are able to avoid expensive parsing when receiving datasets over the wire, unlocking unprecedented levels of performance in moving datasets from one machine to another. A gRPC-based framework for defining custom data services that send and receive Arrow columnar data natively Uses Protocol Buffers v3 for client protocol Pluggable command execution layer, authentication Low-level gRPC optimizations Write Arrow memory directly onto outgoing gRPC buffer Avoid any copying or deserialization tcp over a gigabyte a second of data --- # Other examples * Plasma: in memory, shared object store * BigQuery Storage API
1
* Dremio + Gandiva (Execute LLVM-compiled expressions inside Java-based query engine)
2
* ... .credits[
1
https://cloud.google.com/bigquery/docs/reference/storage/
2
https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/ ] --- # Arrow use cases * Data access * Data movement * **Runtime data structures for analytics** * Efficient in-memory / out-of-core data frame-type analytics * LLVM-compilation for vectorized expression evaluation * Examples: * Dremio: Data-as-a-Service Platform (https://www.dremio.com/) * RAPIDS GPU DataFrame, BlazingSQL --- # Arrow on the GPU RAPIDS framework: https://rapids.ai/ * cuDF: a Python GPU DataFrame library built on the Apache Arrow columnar memory format (as internal representation) .center[  ] --- # Pandas based on Arrow ? .center[  ] ??? What If Pandas .. * had consistent missing data handling * had a proper string data type * had native support for nested data, decimals, dates, ... * didn't make a copy for every operation * could memory-map parts of big tables * had a parallel query engine --- # Pandas based on Arrow ? The Fletcher package (of Uwe Korn) experiments with this: * https://github.com/xhochy/fletcher * Based on the pandas ExtensionArray interface * Using Arrow to store data in pandas Series --- .center[  ] * Cross-language development platform for in-memory, tabular data * Open standard columnar memory format * Efficient interprocess communication and shared computational tools * Support for many languages (C, C++, Java, Javascript, Python, R, MATLAB, Rust, ...) ??? A lot "behind the scenes" --- # Arrow C++ Roadmap .center[  ] --- # Getting involved * Building a community in Apache Arrow * Join dev@arrow.apache.org * PRs to https://github.com/apache/arrow --- .left-column[
.middle[  ] http://ursalabs.org/ ] .right-column[
.small[ * Raise money to support full-time open source developers * Grow Apache Arrow ecosystem * Build cross-language, portable computational libraries for data science * Build relationships across industry ] ] .reset-column[ Initial sponsors: RStudio, Two Sigma, NVIDIA ] --- class: middle https://arrow.apache.org/ # Thanks for listening! ## Those slides: https://github.com/jorisvandenbossche/talks/ ## Slides largely based on slides of Wes McKinney