GeoParquet workshop

class: center, middle

## Cloud-optimized vector formats:<br>a deep dive into GeoParquet

#### Joris Van den Bossche
Cloudscaping Geo Symposium, Delft, October 30, 2025

https://github.com/jorisvandenbossche/2025-cloudscaping-geoparquet-workshop

---

# About me

.larger[**Joris Van den Bossche**]

- Background: PhD bio-science engineer, air quality research
- Open source enthusiast: core developer of pandas, GeoPandas, Shapely, Apache Arrow (pyarrow), ...
- Currently working as a software engineer at Fused

.center[
.affiliations[
  ![:scale 50%](img/geopandas_logo.svg)
  ![:scale 35%](img/arrow-logo_horizontal_black-txt_transparent-bg.svg)
]
]

.center[
<a href="https://fosstodon.org/@jorisvandenbossche" style="color: var(--primary)"><img src="img/mastodon-svgrepo-com.svg" alt="Twitter logo" class="icon"> fosstodon.org/@jorisvandenbossche</a>
<a href="https://github.com/jorisvandenbossche" style="color: var(--primary)"><img src="img/icon_github.svg" alt="Github logo"  class="icon"> github.com/jorisvandenbossche</a>
]

.abs-layout.top-10.left-70[
  ![:scale 100%](img/Symbol-text-white-bg-transparent.svg)
]

---
class: inverse

# Fused (https://fused.io )

### Analytics at the Speed of Thought

* AI-powered data analytics platform with a full Python runtime

* Serverless Python at scale
  * Use the Python stack you are familiar with
  * Call from anywhere (Python, HTTP, ...)
  * Parallel processing and caching

* Geospatial focus
  * Ingest data into Cloud Native formats
  * Integrate with maps and web apps

.abs-layout.top-10.left-80[
  ![:scale 200%](img/Symbol-black-bg-transparent.svg)
]

.abs-layout.top-47.left-57[
  ![:scale 90%](img/fused-workbench.png)
]

---

class: center, middle

# Cloud optimized: what's in a name?

---

# Cloud optimized: what's in a name?

> *The property of a file format to be able to read a meaningful part of the file without needing to download all of the file.*

--
count: false

In particular:

- In a cloud context, data can be read efficiently through HTTP range requests
- All metadata describing chunks can be loaded by a single read operation
- This allows for parallelized and partial reading

--
count: false

But, there is no one size fits all approach! The best optimization depends on the use case.

.bottom[.small-x[
From https://guide.cloudnativegeo.org/glossary.html#cloud-optimized
]]

---
class: center

![:scale 80%](img/cogeo-formats-table.png)

Cloud-Optimized Geospatial Formats (credit: https://guide.cloudnativegeo.org/)

---
class: center, middle

# Apache Parquet

---
layout: true
name: parquet

## What is Apache Parquet?

From http://parquet.apache.org/:

.abs-layout.bottom-1.left-50.width-40[
  ![:scale 100%](img/Apache_Parquet_logo.svg.png)
]

---
layout: false
template: parquet
count: false

> *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.*

---
template: parquet
count: false

> *Apache Parquet is an **open source**, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is **supported in many programming language and analytics tools**.*

---
count: false

> *Apache Parquet is an open source, **column-oriented** data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle **complex data** in bulk and is supported in many programming language and analytics tools.*

---
count: false

> *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides **high performance compression and encoding schemes** to handle complex data in bulk and is supported in many programming language and analytics tools.*

---
count: false

➔ Widely used file format to store large amounts of data (data lakes) for **analytical processing**, often in cloud context

---
layout: false
class: center, middle

# Geospatial data in <img src="img/parquet.DAWCzM4I.svg" width="250px" alt="Apache Parquet">

---
## What is GeoParquet?

Goal:

> *Standardize how geospatial data is represented in Parquet to further
geospatial interoperability among tools and cloud data warehouses using Parquet today, and
introduce a columnar data format to the geospatial world*
--
count: false

➔ Specification how to store geospatial vector data in Parquet files

--
count: false

* Which data type to use (typically Well-Known-Binary (WKB) as binary column)

--
count: false

* Metadata (encoding, coordinate reference system, geometry types, planar vs spherical edges, ...)

.center[
https://github.com/opengeospatial/geoparquet | https://geoparquet.org/
]

???

(using the existing Parquet spec)

Features:

* Multiple spatial reference systems
* Multiple geometry columns
* Work with both planar and spherical coordinates
* Great compression / small files
* Great at read-heavy analytic workflows
* Support for data partitioning
* Enable spatial indices (planned)

---
class: center

![:scale 55%](img/geoparquet_layout.png)

Schematic of Parquet file layout (credit: https://guide.cloudnativegeo.org/)

---

## GeoParquet: fast reading and writing

Python:

```python
import geopandas
geopandas.read_parquet("nz-buildings-outlines.parquet")
```

R:
```r
library(geoarrow)
read_geoparquet_sf("nz-buildings-outlines.parquet")
```

---

## GeoParquet: fast reading and writing

```python
import geopandas
geopandas.read_parquet("nz-buildings-outlines.parquet")
```

![:scale 40%](img/bench_geoparquet.png)
![:scale 27%](img/bench_geoparquet_file_size.png)

.midi[Benchmark using GDAL with pyogrio (GPKG, SHP, FGB) and pyarrow (Parquet) ]
---

## GeoParquet history and status

* History overview:
  * GeoPandas has Parquet IO since June 2020
  * Official initiative started in OGC repo (2021)
  * GeoParquet v0.1.0 (March, 2022): initial release
  * GeoParquet v1.0.0 (September, 2023): first stable release
  * GeoParquet v1.1.0 (June, 2024): bbox column, etc
  * Incubating Open Geospatial Consortium (OGC) standard
  * Future: native support in Apache Parquet itself

* Supported by many tools and libraries, such as GDAL, QGIS, GeoPandas, R sf, DuckDB, BigQuery, Apache Sedona, CARTO, etc

---
class: center, middle

# Demo time!

-> https://github.com/jorisvandenbossche/2025-cloudscaping-geoparquet-workshop

---

## GeoParquet vs Parquet with "native" types

https://cloudnativegeo.org/blog/2025/02/geoparquet-2.0-going-native/

.center[
![:scale 70%](img/geoparquet-native.png)
]

---

## Thanks for the attention! Questions?

Slides and code: https://github.com/jorisvandenbossche/2025-cloudscaping-geoparquet-workshop

Also come talk to me about:

- **pandas/geopandas**: Python DataFrame library and geospatial extension
- **shapely**: Python package providing vector geometry operations (wrapping GEOS C++)
- **pyogrio**: Python bindings to GDAL for bulk (columnar) IO
- **Apache Arrow**: universal columnar format (and **pyarrow**, a Python implementation)
- **GeoArrow**: specifying how to store geospatial data in Apache Arrow

.center[
![:scale 30%](img/Symbol-text-white-bg-transparent.svg)
]