class: center, middle ## Cloud-optimized vector formats:
a deep dive into GeoParquet #### Joris Van den Bossche Cloudscaping Geo Symposium, Delft, October 30, 2025 https://github.com/jorisvandenbossche/2025-cloudscaping-geoparquet-workshop --- # About me .larger[**Joris Van den Bossche**]
- Background: PhD bio-science engineer, air quality research - Open source enthusiast: core developer of pandas, GeoPandas, Shapely, Apache Arrow (pyarrow), ... - Currently working as a software engineer at Fused
.center[ .affiliations[   ] ]
.center[
fosstodon.org/@jorisvandenbossche
github.com/jorisvandenbossche
] .abs-layout.top-10.left-70[  ] --- class: inverse # Fused (https://fused.io ) ### Analytics at the Speed of Thought * AI-powered data analytics platform with a full Python runtime * Serverless Python at scale * Use the Python stack you are familiar with * Call from anywhere (Python, HTTP, ...) * Parallel processing and caching * Geospatial focus * Ingest data into Cloud Native formats * Integrate with maps and web apps .abs-layout.top-10.left-80[  ] .abs-layout.top-47.left-57[  ] --- class: center, middle # Cloud optimized: what's in a name? --- # Cloud optimized: what's in a name? > *The property of a file format to be able to read a meaningful part of the file without needing to download all of the file.* -- count: false In particular: - In a cloud context, data can be read efficiently through HTTP range requests - All metadata describing chunks can be loaded by a single read operation - This allows for parallelized and partial reading -- count: false But, there is no one size fits all approach! The best optimization depends on the use case. .bottom[.small-x[ From https://guide.cloudnativegeo.org/glossary.html#cloud-optimized ]] --- class: center  Cloud-Optimized Geospatial Formats (credit: https://guide.cloudnativegeo.org/) --- class: center, middle # Apache Parquet --- layout: true name: parquet ## What is Apache Parquet? From http://parquet.apache.org/: .abs-layout.bottom-1.left-50.width-40[  ] --- layout: false template: parquet count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.* --- template: parquet count: false > *Apache Parquet is an **open source**, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is **supported in many programming language and analytics tools**.* --- count: false > *Apache Parquet is an open source, **column-oriented** data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle **complex data** in bulk and is supported in many programming language and analytics tools.* --- count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides **high performance compression and encoding schemes** to handle complex data in bulk and is supported in many programming language and analytics tools.* --- count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.* ➔ Widely used file format to store large amounts of data (data lakes) for **analytical processing**, often in cloud context --- layout: false class: center, middle # Geospatial data in
--- ## What is GeoParquet? Goal: > *Standardize how geospatial data is represented in Parquet to further geospatial interoperability among tools and cloud data warehouses using Parquet today, and introduce a columnar data format to the geospatial world* -- count: false ➔ Specification how to store geospatial vector data in Parquet files -- count: false * Which data type to use (typically Well-Known-Binary (WKB) as binary column) -- count: false * Metadata (encoding, coordinate reference system, geometry types, planar vs spherical edges, ...) .center[ https://github.com/opengeospatial/geoparquet | https://geoparquet.org/ ] ??? (using the existing Parquet spec) Features: * Multiple spatial reference systems * Multiple geometry columns * Work with both planar and spherical coordinates * Great compression / small files * Great at read-heavy analytic workflows * Support for data partitioning * Enable spatial indices (planned) --- class: center  Schematic of Parquet file layout (credit: https://guide.cloudnativegeo.org/) --- ## GeoParquet: fast reading and writing Python: ```python import geopandas geopandas.read_parquet("nz-buildings-outlines.parquet") ``` R: ```r library(geoarrow) read_geoparquet_sf("nz-buildings-outlines.parquet") ``` --- ## GeoParquet: fast reading and writing ```python import geopandas geopandas.read_parquet("nz-buildings-outlines.parquet") ```   .midi[Benchmark using GDAL with pyogrio (GPKG, SHP, FGB) and pyarrow (Parquet) ] --- ## GeoParquet history and status * History overview: * GeoPandas has Parquet IO since June 2020 * Official initiative started in OGC repo (2021) * GeoParquet v0.1.0 (March, 2022): initial release * GeoParquet v1.0.0 (September, 2023): first stable release * GeoParquet v1.1.0 (June, 2024): bbox column, etc * Incubating Open Geospatial Consortium (OGC) standard * Future: native support in Apache Parquet itself * Supported by many tools and libraries, such as GDAL, QGIS, GeoPandas, R sf, DuckDB, BigQuery, Apache Sedona, CARTO, etc --- class: center, middle # Demo time! -> https://github.com/jorisvandenbossche/2025-cloudscaping-geoparquet-workshop --- ## GeoParquet vs Parquet with "native" types https://cloudnativegeo.org/blog/2025/02/geoparquet-2.0-going-native/ .center[  ] --- ## Thanks for the attention! Questions? Slides and code: https://github.com/jorisvandenbossche/2025-cloudscaping-geoparquet-workshop Also come talk to me about: - **pandas/geopandas**: Python DataFrame library and geospatial extension - **shapely**: Python package providing vector geometry operations (wrapping GEOS C++) - **pyogrio**: Python bindings to GDAL for bulk (columnar) IO - **Apache Arrow**: universal columnar format (and **pyarrow**, a Python implementation) - **GeoArrow**: specifying how to store geospatial data in Apache Arrow .center[
fosstodon.org/@jorisvandenbossche
github.com/jorisvandenbossche
]
.center[  ]