layout: true <div class="my-footer"> <span> <!-- You could put a footer link in here, which I used to do for a lot of my other slide decks, but it probably doesn't work well with this theme --> </span> </div> --- class: title-slide, theme-green-maximal .center[.large[.boldface[Geospatial and Apache Arrow:
accelerating geospatial data exchange and compute ]]] .center[.midi[Joris Van den Bossche and Dewey Dunnington]] .center[.midi[Voltron Data]] .center[https://jorisvandenbossche.github.io/talks/] --- class: theme-green-minimal ## About Us
Joris Van den Bossche
Arrow Project Management Committee
Pandas maintaner
Geopandas maintainer
Dewey Dunnington
Arrow committer
r-spatial maintainer
Ph.D. Earth & Environmental Science
--- class: theme-green-minimal ## What is Apache Arrow? > A specification defining a common, language-agnostic
> in-memory representation for columnar data
> \+
> A multi-language toolbox for accelerated data interchange
> and in-memory processing .abs-layout.bottom-1.left-70.width-40[ ![:scale 60%](img/arrow-logo_hex_white-txt_black-bg.png) ] --- class: theme-green-minimal ## Accelerating data interchange
Image by Danielle Navarro
--- class: theme-green-minimal ## Accelerating data interchange
Image by Danielle Navarro
--- class: theme-green-minimal ## Efficient in-memory processing
Image by Danielle Navarro
--- class: theme-green-minimal ## Apache Arrow - Fast data access - Data interchange (over network, inter-process, in-process) - Efficient runtime data structure for analytics - Sharing implementations and computational tools --- class: theme-green-minimal count: false ## Apache Arrow + Geospatial - Fast data access - Data interchange (over network, inter-process, in-process) - Efficient runtime data structure for analytics - Sharing implementations and computational tools ➔ All relevant for geospatial data as well! (with a focus on tabular vector data) --- class: theme-green-minimal layout: true name: parquet ## What is Apache Parquet? From http://parquet.apache.org/: .abs-layout.bottom-1.left-50.width-40[ ![](img/Apache_Parquet_logo.svg.png) ] --- layout: false template: parquet count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* --- template: parquet count: false > *Apache Parquet is an **open source**, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is **available in multiple languages** including Java, C++, Python, etc...* --- count: false > *Apache Parquet is an open source, **column-oriented** data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle **complex data** in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* --- count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides **efficient data compression and encoding schemes** with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* --- count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* ➔ Widely used file format to store large amounts of data (data lakes) for analytical processing, often in cloud context --- layout: false class: theme-green-minimal ## What is GeoParquet? Goal: > *Standardize how geospatial data is represented in Parquet to further geospatial interoperability among tools using Parquet today, and hopefully help push forward what's possible with 'cloud-native geospatial' workflows.* -- count: false ➔ Specification how to store geospatial vector data in Parquet files -- count: false * Which data type to use (currently WKB as variable-size binary (BYTE_ARRAY)) -- count: false * Metadata (encoding, coordinate reference system, geometry types, planar vs spherical edges, ...) ??? (using the existing Parquet spec) Features: * Multiple spatial reference systems * Multiple geometry columns * Work with both planar and spherical coordinates * Great compression / small files * Great at read-heavy analytic workflows * Support for data partitioning * Enable spatial indices (planned) --- class: theme-green-minimal ## GeoParquet: fast reading and writing Python: ```python import geopandas geopandas.read_parquet("nz-buildings-outlines.parquet") ``` R: ```r library(geoarrow) read_geoparquet_sf("nz-buildings-outlines.parquet") ``` --- class: theme-green-minimal ## GeoParquet: fast reading and writing ```python import geopandas geopandas.read_parquet("nz-buildings-outlines.parquet") ``` ![:scale 40%](img/bench_geoparquet.png) ![:scale 27%](img/bench_geoparquet_file_size.png) .midi[Benchmark using GDAL master with pyogrio (GPKG, SHP, FGB) and pyarrow (Parquet) ] --- class: theme-green-minimal ## GeoParquet: evolving specification * GeoPandas has Parquet IO since June 2020 * Official initiative started in OGC repo (2021) * GeoParquet v0.1.0 (March 9, 2022): initial release * GeoParquet v0.4.0 (May 26, 2022): use of PROJJSON, ... * Currently active work towards v1.0.0 beta1 release Early contributors include developers from GeoPandas, GeoTrellis, OpenLayers, Vis.gl, Planet, Voltron Data, Microsoft, Carto, Azavea & Unfolded This spec is work in progress! Feedback very welcome! Join at https://github.com/opengeospatial/geoparquet --- class: theme-green-minimal ## GeoParquet: evolving specification Expand support for the format: * Included in GDAL 3.5 * Python (GeoPandas), R (sfarrow & geoarrow) and Julia (GeoParquet.jl) * Online GeoJSON <-> GeoParquet converter based on Go: https://tschaub.net/gpq/ * WIP for Apache Sedona, Carto, ... * Usage in data stores (Microsoft's Planetary Computer Data Catalog, ...) Future work: * Stabilize the spec * Support for data partitioning and spatial indexing ??? Features: * Multiple spatial reference systems * Multiple geometry columns * Work with both planar and spherical coordinates * Great compression / small files * Great at read-heavy analytic workflows * Support for data partitioning * Enable spatial indices (planned) --- class: theme-green-minimal ## Data Interchange ![:scale 75%](img/gdal-data-transport-1.svg) --- class: theme-green-minimal ## Data Interchange with Apache Arrow ![:scale 75%](img/gdal-data-transport-2.svg)
[RFC 86: Column-oriented read API for vector layers](https://github.com/OSGeo/gdal/pull/5830) by Even Rouault (coming to GDAL 3.6) --- class: theme-green-minimal ## Data Interchange with Apache Arrow [RFC 86: Column-oriented read API for vector layers](https://github.com/OSGeo/gdal/pull/5830) by Even Rouault A proposal to output **Arrow C Data interface** structures from GDAL reduces the code required to read a data source to a few lines: ```cpp #include
#include
// GDALDataset* poDS = GDALDataset::Open("path/to/file.gpkg"); int read_ogr_stream(GDALDataset* poDS, struct ArrowArrayStream* stream) { OGRLayer* poLayer = poDS->GetLayer(0); OGRLayerH hLayer = OGRLayer::ToHandle(poLayer); return OGR_L_GetArrowStream(hLayer, stream, nullptr); } ``` (It's also 4-10x faster depending on the driver!) --- class: theme-light-minimal
--- class: theme-light-minimal
--- class: theme-green-minimal ## Apache Arrow + Geospatial Apache Parquet = file format Apache Arrow = memory format (+ ...) -- count: false ➔ GeoParquet -> how do we use Parquet to store geospatial data GeoArrow -> how do we represent geospatial data in Arrow memory --- class: theme-green-minimal ## GeoArrow: an Arrow-native storage format for vector geometries - Arrow defines a rich set of types to encode arrays of pretty much anything...integers, doubles, strings, dates, times, nested lists, and more. - It doesn't define an encoding for geometry! - The GeoArrow specification is an attempt to formalize the encoding of geometry in an Arrow Array. - Efficient format for direct computation .center[ https://github.com/geoarrow/geoarrow ] --- class: theme-green-minimal ## GeoArrow GeoJSON "logical" representation:
``` { "type": "Polygon", "coordinates": [ [[35.0, 10.0], [45.0, 45.0], [15.0, 40.0], [10.0, 20.0], [35.0, 10.0]], [[20.0, 30.0], [35.0, 35.0], [30.0, 20.0], [20.0, 30.0]] ] } ``` GeoArrow physical representation for a "nested list":
.small[`List
>>`] ``` coordinates array: [35.0, 10.0, 45.0, 45.0, 15.0, 40.0, 10.0, 20.0, 35.0, 10.0, 20.0, 30.0, 35.0, 35.0, 30.0, 20.0, 20.0, 30.0, ...] ring offsets array: [0, 5, 9, ...] polygon offsets array: [0, 2, ...] ``` ??? It's fast to iterate over geometry arranged in buffers designed for random access! Geospatial functions in Acero will enable unlock smooth and performant workflows for all types of geospatial data ??? ## Adoption and Applications (logos for projects that already are using Parquet/Arrow/Geo in some form) Listing some of the existing examples: - GeoMesa - Kyle’s JavaScript post - cuSpatial GPU support - Microsoft thinger And other places it can be useful --- class: theme-green-minimal ## GeoArrow It's fast to iterate over geometry arranged in buffers designed for random access! ``` ## # A tibble: 3 × 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ##
## 1 length_geoarrow 18.9ms 19.1ms 52.1 3.7MB 0 ## 2 length_geos 34.9ms 35ms 28.6 3.69MB 0 ## 3 length_wkb 240.3ms 240.5ms 4.15 3.69MB 0 ``` -- count: false Experiments in C++ (https://github.com/geoarrow/geoarrow-cpp), Julia, Rust (https://github.com/geopolars/geopolars), ...
Compatible format used in RAPIDS cuSpatial (https://github.com/rapidsai/cuspatial), datashader (https://github.com/holoviz/datashader/), ... --- class: theme-green-minimal ## Want to know more? Those slides (with links): https://jorisvandenbossche.github.io/talks/2022_FOSS4GBE_geoarrow/ Long version of this presentation at The Data Thread: https://www.youtube.com/watch?v=PbO5FVcPUIQ Blogposts: * Building Bridges: Arrow, Parquet, and Geospatial Computing - Dewey Dunnington: https://dewey.dunnington.ca/post/2022/building-bridges-arrow-parquet-and-geospatial-computing/ * GeoArrow and GeoParquet in deck.gl - Kyle Barron: https://observablehq.com/@kylebarron/geoarrow-and-geoparquet-in-deck-gl --- class: theme-green-maximal ## What next? .pull-left[ - Ongoing specification development - Continued support for GeoParquet - Continued support for GeoArrow ] .pull-right[
twitter.com/jorisvdbossche
twitter.com/paleolimbot
github.com/geoarrow/geoarrow
opengeospatial/geoparquet
]