layout: true <div class="my-footer"> <span> <!-- You could put a footer link in here, which I used to do for a lot of my other slide decks, but it probably doesn't work well with this theme --> </span> </div> --- class: title-slide, theme-green-maximal .center[.large[.boldface[Geospatial and Apache Arrow:
accelerating geospatial data exchange and compute ]]] .center[.midi[Joris Van den Bossche and Dewey Dunnington]] .center[.midi[Voltron Data]] --- class: theme-green-minimal ## About Us
Joris Van den Bossche
Arrow Project Management Committee
Pandas maintaner
Geopandas maintainer
Dewey Dunnington
Arrow committer
r-spatial maintainer
Ph.D. Earth & Environmental Science
--- class: theme-green-minimal ## What is Apache Arrow? > A specification defining a common, language-agnostic
> in-memory representation for columnar data
> \+
> A multi-language toolbox for accelerated data interchange
> and in-memory processing .abs-layout.bottom-1.left-70.width-40[ ![:scale 60%](img/arrow-logo_hex_white-txt_black-bg.png) ] --- class: theme-green-minimal ## Accelerating data interchange
Image by Danielle Navarro
--- class: theme-green-minimal ## Accelerating data interchange
Image by Danielle Navarro
--- class: theme-green-minimal ## Efficient in-memory processing
Image by Danielle Navarro
--- class: theme-green-minimal layout: true name: parquet ## What is Apache Parquet? From http://parquet.apache.org/: .abs-layout.bottom-1.left-50.width-40[ ![](img/Apache_Parquet_logo.svg.png) ] --- layout: false template: parquet count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* --- count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* ➔ Widely used file format to store large amounts of data (data lakes) for analytical processing, often in cloud context --- layout: false class: theme-green-minimal ## Apache Arrow - Fast data access - Data interchange (over network, inter-process, in-process) - Efficient runtime data structure for analytics - Sharing implementations and computational tools --- class: theme-green-minimal count: false ## Apache Arrow + Geospatial - Fast data access - Data interchange (over network, inter-process, in-process) - Efficient runtime data structure for analytics - Sharing implementations and computational tools ➔ All relevant for geospatial data as well! (with a focus on tabular vector data) --- class: theme-green-minimal ## Data Interchange ![:scale 75%](img/gdal-data-transport-1.svg) --- class: theme-green-minimal ## Data Interchange with Apache Arrow ![:scale 75%](img/gdal-data-transport-2.svg)
[RFC 86: Column-oriented read API for vector layers](https://github.com/OSGeo/gdal/pull/5830) by Even Rouault (coming to GDAL 3.6) --- class: theme-light-minimal
--- class: theme-light-minimal
--- class: theme-green-minimal ## Apache Arrow + Geospatial Apache Parquet = file format Apache Arrow = memory format (+ ...) -- count: false ➔ GeoParquet -> how do we use Parquet to store geospatial data GeoArrow -> how do we represent geospatial data in Arrow memory --- layout: false class: theme-green-minimal ## What is GeoParquet? Goal: > *Standardize how geospatial data is represented in Parquet to further geospatial interoperability among tools using Parquet today, and hopefully help push forward what's possible with 'cloud-native geospatial' workflows.* ➔ Specification how to store geospatial vector data in Parquet files * Which data type to use (currently WKB as variable-size binary (BYTE_ARRAY)) * Metadata (encoding, coordinate reference system, geometry types, planar vs spherical edges, ...) --- class: theme-green-minimal ## GeoArrow: an Arrow-native storage format for vector geometries - Arrow defines a rich set of types to encode arrays of pretty much anything...integers, doubles, strings, dates, times, nested lists, and more. - It doesn't define an encoding for geometry! - The GeoArrow specification is an attempt to formalize the encoding of geometry in an Arrow Array. - Efficient format for direct computation .center[ https://github.com/geopandas/geo-arrow-spec ] --- class: theme-green-minimal # GeoArrow GeoJSON "logical" representation: ``` { "type": "Polygon", "coordinates": [ [[35.0, 10.0], [45.0, 45.0], [15.0, 40.0], [10.0, 20.0], [35.0, 10.0]], [[20.0, 30.0], [35.0, 35.0], [30.0, 20.0], [20.0, 30.0]] ] } ``` GeoArrow physical representation for a "nested list": ``` coordinates array: [35.0, 10.0, 45.0, 45.0, 15.0, 40.0, 10.0, 20.0, 35.0, 10.0, 20.0, 30.0, 35.0, 35.0, 30.0, 20.0, 20.0, 30.0, ...] ring offsets array: [0, 5, 9, ...] polygon offsets array: [0, 2, ...] ``` ??? It's fast to iterate over geometry arranged in buffers designed for random access! Geospatial functions in Acero will enable unlock smooth and performant workflows for all types of geospatial data ??? ## Adoption and Applications (logos for projects that already are using Parquet/Arrow/Geo in some form) Listing some of the existing examples: - GeoMesa - Kyle’s JavaScript post - cuSpatial GPU support - Microsoft thinger And other places it can be useful --- class: theme-green-minimal ## Want to know more? Those slides (with links): https://jorisvandenbossche.github.io/talks/2022_FOSS4G_geoarrow/ Long version of this presentation at The Data Thread: https://www.youtube.com/watch?v=PbO5FVcPUIQ Blogposts: * Building Bridges: Arrow, Parquet, and Geospatial Computing - Dewey Dunnington: https://dewey.dunnington.ca/post/2022/building-bridges-arrow-parquet-and-geospatial-computing/ * GeoArrow and GeoParquet in deck.gl - Kyle Barron: https://observablehq.com/@kylebarron/geoarrow-and-geoparquet-in-deck-gl --- class: theme-green-maximal ## What next? .pull-left[ - Ongoing specification development - Continued support for GeoParquet - Continued support for GeoArrow ] .pull-right[
twitter.com/jorisvdbossche
twitter.com/paleolimbot
github.com/geopandas/geo-arrow-spec
opengeospatial/geoparquet
]