class: center, middle # Shapely 2.0 ## An improved foundation for GeoPandas and the Python geospatial ecosystem Joris Van den Bossche, GeoPython, March 7, 2023 https://github.com/jorisvandenbossche/talks/ --- # About me .larger[**Joris Van den Bossche**]
- Background: PhD bio-science engineer, air quality research - Open source enthusiast: core developer of pandas, GeoPandas, Shapely, Apache Arrow, ... - Currently working part-time at Voltron Data on Apache Arrow
.affiliations[    ] .larger[.center[
twitter.com/jorisvdbossche
github.com/jorisvandenbossche
]] --- class: center, middle # Introduction: GEOS --- # Vector processing in QGIS .center[  ] -- count: false .bottom[ ⮕ using the
GEOS
library under the hood. ] --- # Vector processing in Postgis Example from https://postgis.net/workshops/postgis-intro/: ```sql SELECT subways.name AS subway_name, neighborhoods.name AS neighborhood_name, neighborhoods.boroname AS borough FROM nyc_neighborhoods AS neighborhoods JOIN nyc_subway_stations AS subways ON `ST_Contains`(neighborhoods.geom, subways.geom) WHERE subways.name = 'Broad St'; ``` -- count: false .bottom[ ⮕ using the
GEOS
library under the hood. ] --- # Vector processing in R (sf) Example from FOSS4G Belgium presentation by Thomas Goossens (https://pokyah.shinyapps.io/foss4GBXL2018): ```r library(sf) belgium = sf::st_as_sf( rnaturalearth::ne_states(country = 'belgium')) wallonia = belgium %>% dplyr::filter(region == "Walloon") grid = `sf::st_intersection`( grid, sf::st_transform(wallonia, crs = 3812)) ``` -- count: false .bottom[ ⮕ using the
GEOS
library under the hood. ] --- # Vector processing in Python Using Shapely and GeoPandas: ```python import geopandas import shapely.geometry districts = geopandas.read_file("paris_districts.gpkg") notre_dame = shapely.geometry.Point(452321, 5411311) # filter districts that contain the point districts[districts.`contains`(notre_dame)] ``` -- count: false .bottom[ ⮕ using the
GEOS
library under the hood. ] --- # GEOS
## Geometry Engine Open Source * C/C++ port of a subset of Java Topology Suite (JTS) * Most widely used geospatial C++ computational geometry library * Implements geometry objects (simple features), spatial predicate functions and spatial operations, prepared geometries, STR tree spatial index, WKT/WKB encoding and decoding Used under the hood by many applications (GDAL, QGIS, PostGIS, MapServer, GRASS, GeoDjango, ...) [https://libgeos.org/](https://libgeos.org/) ??? JTS itself also used in GeoServer, GeoTools, ... --- # Simple features Simple feature access - OGC / ISO standard:  --- # Spatial predicates https://en.wikipedia.org/wiki/DE-9IM .center[  ] --- # Spatial operations
--- # Spatial operations
--- # GEOS
## Geometry Engine Open Source * C/C++ port of a subset of Java Topology Suite (JTS) * Most widely used geospatial C++ computational geometry library * Implements geometry objects (simple features), spatial predicate functions and spatial operations, prepared geometries, STR tree spatial index, WKT/WKB encoding and decoding Used under the hood by many applications (GDAL, QGIS, PostGIS, MapServer, GRASS, GeoDjango, ...) [https://libgeos.org/](https://libgeos.org/) --- class: middle, center # GEOS in the Python ecosystem --- # Shapely Python package for the manipulation and analysis of geometric objects
Pythonic interface to GEOS -- count:false .mmedium[ ```python >>> from shapely.geometry import Point, LineString, Polygon >>> point = Point(1, 1) >>> line = LineString([(0, 0), (1, 2), (2, 2)]) >>> poly = line.buffer(1) ``` ]
.mmedium[ ```python >>> poly.contains(point) True ``` ] -- count: false Nice interface to GEOS, but: single objects, no attributes --- # GeoPandas Make working with tabular geospatial data in python easier by combining Shapely and pandas * Extends the pandas data analysis library to work with geographic objects and spatial operations * Combines the power of whole ecosystem of (geo) tools (pandas, geos, shapely, gdal, fiona, pyproj, rtree, ...) * Bridge between geospatial packages and the scientific / data science stack Documentation: http://geopandas.readthedocs.io/ .center[  ] ??? What Postgis is for databases/postgresql, GeoPandas is for Python/pandas make working with geospatial data like working with any other kind of data in python (data stack, numpy, pandas and other tools around those) analysis for which you otherwise would need desktop GIS applications (QGIS, ArcGIS) or geospatial databases (PostGIS) makes pandas objects geometry aware GeoPandas gives a very nice interface (although I am biased of course), but performance wise there is certainly room for improvement. --- # GeoPandas builds upon Shapely - GeoPandas stores custom Python (Shapely) objects in arrays - For operations, it iterates through those objects - The Shapely objects each call the GEOS C operation
.center[  ] --- # GeoPandas builds upon Shapely - GeoPandas stores custom Python (Shapely) objects in arrays - For operations, it iterates through those objects - The Shapely objects each call the GEOS C operation
Before Shapely 2.0: ```python class GeoSeries: ... def distance(self, other): * result = [geom.distance(other) for geom in self.geometry] return pd.Series(result) ``` --- # The (long) road to Shapely 2.0 - My personal motivation: improving GeoPandas' performance - 2017: experimented with cython in GeoPandas - 2019: Casper Van der Wel started PyGEOS - 2020: Proposal to integrate PyGEOS into Shapely - 2021: Internals of Shapely refactored and PyGEOS code merged - 2022: Shapely 2.0.0 released! --- class: middle, center # Shapely 2.0 --- # Shapely 2.0 Code of PyGEOS has been fully integrated into Shapely (including array-based API). Highlights: - Refactor of the internals and the Geometry class (C extension type instead of `ctypes`) - Fast, vectorized (element-wise) array operations - Preserve familiar Shapely interface for single geometries - Many new features (better STRtree, fixed precision, ...) and latest GEOS functions ??? Prototyped in PyGEOS, started by (https://github.com/pygeos/pygeos/).
New way to expose geospatial operations from GEOS into Python: - array-based + vectorized functions - fast Merged into Shapely to become Shapely 2.0: * Code of PyGEOS has been fully integrated into Shapely (including array-based API) * Internals of Shapely are refactored to use C extension type instead of `ctypes` for the Geometry class * Preserve familiar Shapely interface for single geometries * Numpy is now a required dependency --- # Vectorized (element-wise) array functions Instead of a manual `for` loop: .larger[ ```python [polygon.contains(point) for point in points] ``` ] you can do .larger[ ```python shapely.contains(polygon, points) ``` ] -- Those functions are numpy ufuncs! --- # Fast Benchmark for 1M points: contained in or distance to a polygon   Significant performance increase: 80x (contains) to 5x (distance) for this example --- # Fast Spatial join of NYC neighborhoods with census blocks (example from https://postgis.net/workshops/postgis-intro/) ```python geopandas.sjoin( nyc_neighborhoods, nyc_census_blocks, predicate='intersects' ) ``` .center[  ] ??? But depends a lot on the exact case (characteristics of the geometries being joined). Another artificial case of point-in-polygon with a grid, get 40x speed-up. --- # Geometries ❤️ numpy arrays The geometry objects (Point, LineString, Polygon, .. classes) are now **immutable**, **hashable** and **non-iterable** objects, making it a *lot* easier to work with numpy arrays of geometries. ⮕ this caused some behavioural changes! See all details in the 1.8 -> 2.0 migration guide at https://shapely.readthedocs.io/en/latest/migration.html --- # Geometries ❤️ numpy arrays Multi-part geometries are no longer "sequences" (length, indexable, iterable) .pull-left[ Shapely 1.8 ```python mp = MultiPoint(...) # get the number of sub points len(mp) # get the first sub point mp[0] # get a list of sub points len(mp) ``` ] .pull-right[ Shapely 2.0 ```python mp = MultiPoint(...) # get the number of sub points len(mp`.geoms`) # get the first sub point mp`.geoms`[0] # get a list of sub points len(mp`.geoms`) ``` ] This `.geoms` attribute works in Shapely 1.x as well! --- # Geometries ❤️ numpy arrays Single-part geometries (Point, LineString, LinearRing) no longer directly convert to a numpy array of coordinates: .pull-left[ Shapely 1.8 ```python >>> line = LineString( ... [(0, 0), (1, 1), (2, 2)] ... ) >>> np.asarray(line) array([[0., 0.], [1., 1.], [2., 2.]]) ``` ] .pull-right[ Shapely 2.0 ```python >>> line = LineString( ... [(0, 0), (1, 1), (2, 2)] ... ) >>> np.asarray(line`.coords`) array([[0., 0.], [1., 1.], [2., 2.]]) ``` ] This `.coords` attribute works in Shapely 1.x as well! --- # A peak under the hood Shapely 1.x used `ctypes` to interact with GEOS from Python Shapely 2.0 implements a Python "extension type" (a Python class defined in C): ```c typedef struct { PyObject_HEAD void* ptr; // -> pointer to GEOSGeometry ... } GeometryObject; ``` This allows to write the array functions in C as well, accessing the GEOSGeometry pointer without Python overhead, using the numpy ufunc C API. --- # Vectorized (element-wise) array functions Shapely 2.0 allows you to work with arrays of geometries: .larger[ ```python shapely.contains(polygon, points) ``` ] and provides such vectorized (element-wise) versions of all spatial functions that: 1. give a considerable performance improvement 2. release the GIL, allowing multithreading, e.g. in dask-geopandas 3. is simply easier than for loops --- # ... and much more goodies - Geometry subclasses and `shapely.ops` functions are now available in the top-level namespace - More informative repr with truncated WKT - Bindings for more GEOS functions - STRtree: direct predicate evaluation, query an array of geometries at once, `query_nearest` - Support for fixed precision model for geometries and in overlay functions - ... See full release notes at https://shapely.readthedocs.io/en/latest/release/2.x.html#version-2-0-0 ??? Further, many new features from the latest GEOS versions are now exposed: fixed precision overlay functions, distance within, segmentize, remove repeated points, concave hull, ... --- # Bindings for more GEOS functions Several (new) functions from GEOS are now exposed in Shapely: * `hausdorff_distance` and `frechet_distance` * `contains_properly` * `contains_xy` and `intersects_xy` * `dwithin` * `segmentize`, `node`, `extract_unique_points`, `build_area` * `reverse` * `minimum_bounding_circle` and `minimum_bounding_radius` * `coverage_union` and `coverage_union_all` * `remove_repeated_points` (GEOS >= 3.11) * `concave_hull` (GEOS >= 3.11) --- # Concave Hull .center[  ] Convex vs concave hull. Image from http://lin-ear-th-inking.blogspot.com/2022/01/concave-hulls-in-jts.html --- # Spatial indexing with the STRtree Point-in-Polygon example: which of the 500 polygons contain which of the 20,000 points? ```python >>> shapely.contains(polygons, points[:, np.newaxis]) array([[False, False, ..., False, False], [False, False, ..., True, False], ..., [False, False, ..., False, False], [False, False, ..., False, False]]) ``` ⮕ takes ca 250ms ```python >>> tree = shapely.STRtree(points) *>>> tree.query(polygons, predicate="contains") array([[ 1, 1, 1, ..., 484, 484, 484], [8780, 8219, 262, ..., 2224, 5212, 4210]]) ``` ⮕ takes ca 1ms --- # Spatial indexing with the STRtree For each polygon, find all points that are within a certain distance: ```python tree = shapely.STRtree(polygons) indices = tree.query(points, predicate="dwithin", distance=5_000) ``` For each polygon, find the closest point (and optionally the distance to the closes point): ```python tree = shapely.STRtree(polygons) indices, distances = tree.query_nearest(points, return_distance=True) ``` --
Those spatial index methods are used under the hood in GeoPandas `sjoin` and `sjoin_nearest`. --- ## Precision reduction and fixed-precision overlay Precision reduction producing in valid geometries using `shapely.set_precision`: .center[  ] A new `grid_size` keyword in `shapely.intersection` et al: resulting overlay geometry computed on the same grid. Image from http://lin-ear-th-inking.blogspot.com/2020/05/jts-overlay-next-generation.html --- # You were already using PyGEOS? Migrating to Shapely 2.0 is easy! Change your `pygeos` import with `shapely` (only a few keyword names and the STRtree API changed) https://shapely.readthedocs.io/en/stable/migration_pygeos.html --- class: middle # Thanks for listening! Questions? ### Thanks to Casper van der Wel and Brendan Ward for PyGEOS
Thanks to Sean Gillies and Mike Taves and the other Shapely contributors ### Want to help out? There is a sprint tomorrow! ### Those slides: - https://github.com/jorisvandenbossche/talks/ - [jorisvandenbossche.github.io/talks/2023_GeoPython_Shapely-2.0]( http://jorisvandenbossche.github.io/talks/2023_GeoPython_Shapely-2.0)