class: center, middle ![:scale 50%](img/geopandas_logo.svg) # State of GeoPandas ecosystem Joris Van den Bossche & Martin Fleischmann, GeoPython, June 21, 2022 https://github.com/jorisvandenbossche/talks/ --- # About us
**Joris Van den Bossche** - Background: PhD bio-science engineer, air quality research - pandas core dev, geopandas maintainer, scikit-learn contributor - Currently working part-time at Voltron Data on Apache Arrow
**Martin Fleischmann** * Researcher at Geographic Data Science Lab at the University of Liverpool * Urban morphology and geographic data science * Author of momepy package, member of development teams of GeoPandas and PySAL .center[
twitter.com/jorisvdbossche
twitter.com/martinfleis
] ??? https://github.com/jorisvandenbossche Twitter: [@jorisvdbossche](https://twitter.com/jorisvdbossche) .affiliations[ ![:scale 30%](img/apache-arrow.png) ![:scale 64%](img/voltrondata-logo-green.png) ] --- class: middle, center # GeoPandas # Easy, fast and scalable geospatial analysis in Python --- # GeoPandas Make working with geospatial data in python easier * Started by Kelsey Jordahl in 2013 * Extends the pandas data analysis library to work with geographic objects and spatial operations * Combines the power of whole ecosystem of (geo) tools (pandas, geos, shapely, gdal, fiona, pyproj, rtree, ...) Documentation: http://geopandas.readthedocs.io/ ??? make working with geospatial data like working with any other kind of data in python (data stack, numpy, pandas and other tools around those) analysis for which you otherwise would need desktop GIS applications (QGIS, ArcGIS) or geospatial databases (PostGIS) makes pandas objects geometry aware properties shapefile: shapes / attributes => fits dataframe => geopandas => made easy: - familiar for pandas users - provides easy access to geometrical operations for arrays of geometries (no loops anymore) - joins, overlays, mapping GeoPandas showcase (filtering, plotting, join, projection, ....) --- # GeoPandas * Read and write variety of formats (fiona, GDAL/OGR) * Familiar manipulation of the attributes (pandas dataframe) * Element-wise spatial predicates (intersects, within, ...) and operations (intersection, union, difference, ..) (shapely) * Re-project your data (pyproj) * Quickly visualize the geometries (matplotlib, folium) * More advanced spatial operations: spatial joins and overlays **➔ Interactive exploration and analysis of geospatial data** --- class: middle, center # GeoPandas # .darkred[.fat[Easy]], fast and scalable geospatial analysis in Python ## ➔ Some highlights of new features of the last releases --- ## New interactive `explore()` using Folium / leaflet.js ```python gdf.explore() ``` .center[
] --- ## New interactive `explore()` using Folium / leaflet.js ```python gdf.explore(column="gdp_md_est", scheme="NaturalBreaks") ``` .center[
] --- ## Joining based on proximity: `sjoin_nearest()` `sjoin()` requires an exact predicate (within, contains, intersects, ...). New `sjoin_nearest()` method to join based on proximity, with the ability to set a maximum search radius. -- count: false
Example : joining nearest road attributes to points of interest: ```python geopandas.sjoin_nearest(pois, roads, max_distance=50, distance_col="distance") ``` --- ## Some other recent improvements * New `GeoDataFrame.to_postgis()` method to write to PostGIS database * `geopandas.read_file()` will now automatically recognize zip files, without needing to prepend "zip://" to the path * Better support for GeoDataFrames with multiple geometry columns * ... and many bug fixes and small improvements! --- ## `xyzservices`: Source of XYZ tiles providers Lightweight library providing a collection of available XYZ services offering raster basemap tiles ```python >>> import xyzservices.providers as xyz >>> xyz.OpenStreetMap.Mapnik {'url': 'https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', 'max_zoom': 19, 'html_attribution': '©
OpenStreetMap
contributors', 'attribution': '(C) OpenStreetMap contributors', 'name': 'OpenStreetMap.Mapnik'} >>> xyz..Stamen.TonerLite.url 'https://stamen-tiles-{s}.a.ssl.fastly.net/{variant}/{z}/{x}/{y}{r}.{ext}' ``` Used by geopandas, contextily, ipyleaflet, leafmap, bokeh, holoviews, ... --- class: center, middle ![:scale 80%](img/overview-libraries.png) ??? What I showed are some features implemented in geopandas itself (although of course still depending on external dependencies), but geopandas is also depending on some other packages to provide the core geospatial capabilities, and next we will look at improvements in those dependencies --- class: middle, center # GeoPandas # Easy, .darkred[.fat[fast]] and scalable geospatial analysis in Python ## ➔ Improvements in the underlying geospatial Python stack --- class: center, middle ![:scale 80%](img/overview-libraries-1.png) --- class: center, middle # Shapely 2.0 / PyGEOS --- # Shapely Python package for the manipulation and analysis of geometric objects
Pythonic interface to GEOS -- count:false .mmedium[ ```python >>> from shapely.geometry import Point, LineString, Polygon >>> point = Point(1, 1) >>> line = LineString([(0, 0), (1, 2), (2, 2)]) >>> poly = line.buffer(1) ``` ]
.mmedium[ ```python >>> poly.contains(point) True ``` ] -- count: false Nice interface to GEOS, but: single objects, no attributes --- # Why is GeoPandas slow? - GeoPandas stores custom Python (Shapely) objects in arrays - For operations, it iterates through those objects - The Shapely objects each call the GEOS C operation
.center[ ![:scale 60%](img/geopandas-shapely-1.svg) ] --- # Why is GeoPandas slow? - GeoPandas stores custom Python (Shapely) objects in arrays - For operations, it iterates through those objects - The Shapely objects each call the GEOS C operation
```python class GeoSeries: ... def distance(self, other): result = [geom.distance(other) for geom in self.geometry] return pd.Series(result) ``` --- # Making it faster - Move the loop into C and iterate directly over pointers to GEOS objects
.center[ ![:scale 60%](img/geopandas-shapely-2.svg) ] --- # Shapely 2.0 Prototyped in PyGEOS, started by Casper van der Wel (https://github.com/pygeos/pygeos/), merged into Shapely to become Shapely 2.0 New way to expose geospatial operations from GEOS into Python: - array-based + vectorized functions - fast --- # Array-based Instead of a manual `for` loop: ```python [poly.contains(point) for point in points] ``` you can do ```python shapely.contains(poly, points) ``` --- # Fast Benchmark for 1M points: contained in or distance to a polygon ![:scale 49%](img/pygeos_timings-contains.png) ![:scale 49%](img/pygeos_timings-distance.png) Significant performance increase: 80x (contains) to 5x (distance) for this example --- # Fast Spatial join of NYC neighborhoods with census blocks (example from https://postgis.net/workshops/postgis-intro/) ```python geopandas.sjoin( nyc_neighborhoods, nyc_census_blocks, op='intersects' ) ``` .center[ ![:scale 40%](img/geopandas-timings_sjoin_pygeos.png) ] ??? But depends a lot on the exact case (characteristics of the geometries being joined). Another artificial case of point-in-polygon with a grid, get 40x speed-up. --- # Roadmap for Shapely 2.0 See full proposal at https://github.com/shapely/shapely-rfc/pull/1/ Current status: * Code of PyGEOS has been fully integrated into Shapely (including array-based API) * Internals of Shapely are refactored to use C extension type instead of `ctypes` for the Geometry class * Preserve familiar Shapely interface for single geometries * Numpy is now a required dependency * Set of API changes --- # Roadmap for Shapely 2.0 See full proposal at https://github.com/shapely/shapely-rfc/pull/1/ * Set of API changes: * Make the Geometry objects immutable + hashable * Remove "array interface" (`np.array(line)` -> `np.array(line.coords)`) * Multi-part geometries are no longer iterable (`list(multi_polygon)` -> `list(multi_polygon.geoms)`) --- # Roadmap for Shapely 2.0 Current status: * Shapely 1.8 has been released with deprecation warnings * We are planning a first Shapely 2.0 alpha release in July -- count: false How can you help? * GeoPandas can already optionally use PyGEOS (will be updated to work with Shapely 2.0) * https://geopandas.readthedocs.io/en/latest/getting_started/install.html#using-the-optional-pygeos-dependency * Having code / package that rely on Shapely? Test with Shapely 1.8 and update your code * Test with Shapely 2.0 once first alpha is out This includes a large set of changes. Testing and giving feedback is very useful! ??? * Try out GeoPandas with PyGEOS and give feedback --- class: center, middle ![:scale 80%](img/overview-libraries-2.png) --- class: center, middle # pyogrio ## Vectorized spatial vector file format I/O using GDAL/OGR --- # pyogrio GeoPandas-oriented API to read/write GDAL/OGR vector data sources. Faster than Fiona, but less general purpose. Started by Brendan Ward, adapted from Fiona. Has Windows wheels! https://github.com/geopandas/pyogrio --- # pyogrio Direct use of pyogrio API: ```python import pyogrio gdf = pyogrio.read_dataframe("path/to/file") pyogrio.write_dataframe(gdf, "path/to/file") ``` Use through familiar GeoPandas API (>= 0.11): ```python gdf = pyogrio.read_file("path/to/file", engine="pyogrio") gdf.to_file("path/to_file", engine="pyogrio") ``` Goal to make `engine="pyogrio"` the default in the future. --- # pyogrio benchmark ```python geopandas.read_parquet("nz-buildings-outlines.gpkg", engine="fiona"|"pyogrio") ``` ![:scale 49%](img/bench_pyogrio_read.png) ![:scale 49%](img/bench_pyogrio_memory.png) We have seen >5-10x speedups reading files and >5-20x speedups writing files compared to using non-vectorized approaches (current I/O support in GeoPandas using Fiona). ??? GDAL RCF columnar layout? --- class: middle, center .medium[ ``` pip install pyogrio # or conda install -c conda-forge pyogrio ``` ]
### ➔ Feedback to https://github.com/geopandas/pyogrio/issues/ --- class: center, middle # GeoParquet --- class: theme-green-minimal layout: true name: parquet ## What is Apache Parquet? From http://parquet.apache.org/: .abs-layout.bottom-1.right-50.width-20[ ![](img/Apache_Parquet_logo.svg.png) ] --- layout: false template: parquet count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* --- template: parquet count: false > *Apache Parquet is an **open source**, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is **available in multiple languages** including Java, C++, Python, etc...* --- count: false > *Apache Parquet is an open source, **column-oriented** data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle **complex data** in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* --- count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides **efficient data compression and encoding schemes** with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* --- count: false > *Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...* ➔ Widely used file format to store large amounts of data (data lakes) for analytical processing, often in cloud context --- layout: false class: theme-green-minimal ## What is GeoParquet? Goal: > *Standardize how geospatial data is represented in Parquet to further geospatial interoperability among tools using Parquet today, and hopefully help push forward what's possible with 'cloud-native geospatial' workflows.* -- count: false ➔ Specification how to store geospatial vector data in Parquet files -- count: false * Which data type to use (currently WKB as variable-size binary (BYTE_ARRAY)) -- count: false * Metadata (encoding, coordinate reference system, geometry types, planar vs spherical edges, ...) ??? (using the existing Parquet spec) Features: * Multiple spatial reference systems * Multiple geometry columns * Work with both planar and spherical coordinates * Great compression / small files * Great at read-heavy analytic workflows * Support for data partitioning * Enable spatial indices (planned) --- class: theme-green-minimal ## GeoParquet: fast reading and writing Python: ```python import geopandas geopandas.read_parquet("nz-buildings-outlines.parquet") ``` R: ```r library(geoarrow) read_geoparquet_sf("nz-buildings-outlines.parquet") ``` --- class: theme-green-minimal ## GeoParquet: fast reading and writing ```python import geopandas geopandas.read_parquet("nz-buildings-outlines.parquet") ``` ![:scale 40%](img/bench_geoparquet.png) ![:scale 27%](img/bench_geoparquet_file_size.png) .midi[Benchmark using GDAL master with pyogrio (GPKG, SHP, FGB) and pyarrow (Parquet) ] --- class: theme-green-minimal ## GeoParquet: evolving specification * GeoPandas has Parquet IO since June 2020 * Official initiative started in OGC repo (2021) * GeoParquet v0.1.0 (March 9, 2022): initial release * GeoParquet v0.4.0 (May 26, 2022): use of PROJJSON, ... Early contributors include developers from GeoPandas, GeoTrellis, OpenLayers, Vis.gl, Planet, Voltron Data, Microsoft, Carto, Azavea & Unfolded This spec is work in progress! Feedback very welcome! https://github.com/opengeospatial/geoparquet --- class: theme-green-minimal ## GeoParquet: evolving specification Expand support for the format: * Included in GDAL 3.5 * Python (GeoPandas), R (sfarrow & geoarrow) and Julia (GeoParquet.jl) * WIP for Apache Sedona, Carto, ... * Usage in data stores (Microsoft's Planetary Computer Data Catalog, ...) Future work: * Support for data partitioning and spatial indexing * Stabilize the spec Welcome to join at https://github.com/opengeospatial/geoparquet ! ??? Features: * Multiple spatial reference systems * Multiple geometry columns * Work with both planar and spherical coordinates * Great compression / small files * Great at read-heavy analytic workflows * Support for data partitioning * Enable spatial indices (planned) --- class: middle, center # GeoPandas # Easy, fast and .darkred[.fat[scalable]] geospatial analysis in Python ## ➔ Introducing dask-geopandas ??? Python has a fast and pragmatic data science ecosystem ... restricted to in-memory and a single core --- .center[ ![:scale 55%](img/dask_horizontal.svg) ] ## A flexible library for parallelism * A parallel computing framework, written in pure Python * Lets you work on larger-than-memory datasets * That leverages the excellent Python ecosystem * Using blocked algorithms and task scheduling https://www.dask.org/ ??? # An experiment with taxi data [Ravi Shekhar](http://people.earth.yale.edu/profile/ravi-shekhar/about) published a blogpost [Geospatial Operations at Scale with Dask and GeoPandas](https://medium.com/towards-data-science/geospatial-operations-at-scale-with-dask-and-geopandas-4d92d00eb7e8) in which he counted the number of rides originating from each of the official taxi zones of New York City Matthew Rocklin re-ran the experiment with the in-development version: 3h -> 8min ([see his blogpost](http://matthewrocklin.com/blog/work/2017/09/21/accelerating-geopandas-1)) [dask-geopandas](https://github.com/mrocklin/dask-geopandas): experimental library with parallelized geospatial operations and joins ## Demo time! See [static version](http://nbviewer.jupyter.org/gist/jorisvandenbossche/67be41a246c1281d7046b31690988321) --- # GeoPandas - dask bridge https://github.com/geopandas/dask-geopandas ### New library with parallelized geospatial operations and joins -- count: false
.center[ ## Demo time! See [static version](https://dask-geopandas.readthedocs.io/en/stable/guide/basic-intro.html) ] --- # GeoPandas - dask bridge https://github.com/geopandas/dask-geopandas ### New library with parallelized geospatial operations and joins Many areas for improvement: - Higher coverage of functionality (overlay, ...) - Better use of spatial indexing - ... Young project, but ready to be tried out! --- class: middle # Thanks for listening! ## Thanks to all contributors! ## Those slides: - https://github.com/jorisvandenbossche/talks/ - [jorisvandenbossche.github.io/talks/2022_GeoPython_geopandas]( http://jorisvandenbossche.github.io/talks/2022_GeoPython_geopandas) http://geopandas.org --- --- class: middle, center # Open source geospatial software .center[ ![:scale 70%](img/Open_Source_Geospatial_Foundation.svg) ] ??? # geospatial software This presentation: in python but everything I will present -> builds upon widely used open source libraries Open Source Geospatial Foundation OSGeo was created to support the collaborative development of open source geospatial software, and promote its widespread use. --- # GDAL / OGR ### Geospatial Data Abstraction Library.
* The swiss army knife for geospatial. * Read and write Raster (GDAL) and Vector (OGR) datasets * More than 200 (mainly) geospatial formats and protocols. .center[ ![:scale 100%](img/gdal_formats) ] .credits[ Slide from "GDAL 2.2 What's new?" by Even Rouault (CC BY-SA) ] ??? GDAL is a translator library for raster and vector geospatial data formats. As a library, it presents a single raster abstract data model and single vector abstract data model to the calling application for all supported formats. It also comes with a variety of useful command line utilities for data translation and processing. --- # GEOS
## Geometry Engine Open Source * C/C++ port of a subset of Java Topology Suite (JTS) * Most widely used geospatial C++ geometry library * Implements geometry objects (simple features), spatial predicate functions and spatial operations Used under the hood by many applications (QGIS, PostGIS, MapServer, GRASS, GeoDjango, ...) [geos.osgeo.org](http://geos.osgeo.org)