class: center, middle # Shapely 2.0 ## An improved foundation for GeoPandas and the Python geospatial ecosystem Joris Van den Bossche, GeoPython, September 22, 2020 https://github.com/jorisvandenbossche/talks/ [@jorisvdbossche](https://twitter.com/jorisvdbossche) --- # About me Joris Van den Bossche
- Background: PhD bio-science engineer, air quality research - Open source enthusiast: pandas core dev, geopandas maintainer, scikit-learn contributor - Currently freelance open source developer and teacher, working part-time on Apache Arrow (at Ursa Labs) https://github.com/jorisvandenbossche Twitter: [@jorisvdbossche](https://twitter.com/jorisvdbossche)
.center[ .affiliations[ ![:scale 29%](img/pandas.svg) ![:scale 39%](img/geopandas_logo.svg) ![:scale 22%](img/ursa_logo_inverted.png) ] ] --- class: center, middle # GEOS --- # Vector processing in QGIS .center[ ![:scale 75%](img/example_QGIS.png) ] -- count: false .bottom[ ⮕ using the
GEOS
library under the hood. ] --- # Vector processing in Postgis Example from https://postgis.net/workshops/postgis-intro/: ```sql SELECT subways.name AS subway_name, neighborhoods.name AS neighborhood_name, neighborhoods.boroname AS borough FROM nyc_neighborhoods AS neighborhoods JOIN nyc_subway_stations AS subways ON `ST_Contains`(neighborhoods.geom, subways.geom) WHERE subways.name = 'Broad St'; ``` -- count: false .bottom[ ⮕ using the
GEOS
library under the hood. ] --- # Vector processing in R (sf) Example from FOSS4G Belgium presentation by Thomas Goossens (https://pokyah.shinyapps.io/foss4GBXL2018): ```r library(sf) belgium = sf::st_as_sf( rnaturalearth::ne_states(country = 'belgium')) wallonia = belgium %>% dplyr::filter(region == "Walloon") grid = `sf::st_intersection`( grid, sf::st_transform(wallonia, crs = 3812)) ``` -- count: false .bottom[ ⮕ using the
GEOS
library under the hood. ] --- # Vector processing in Python Using Shapely and GeoPandas: ```python import geopandas import shapely.geometry districts = geopandas.read_file("paris_districts.gpkg") notre_dame = shapely.geometry.Point(452321, 5411311) # filter districts that contain the point districts[districts.`contains`(notre_dame)] ``` -- count: false .bottom[ ⮕ using the
GEOS
library under the hood. ] --- # GEOS
## Geometry Engine Open Source * C/C++ port of a subset of Java Topology Suite (JTS) * Most widely used geospatial C++ geometry library * Implements geometry objects (simple features), spatial predicate functions and spatial operations, prepared geometries, STR spatial index, WKT/WKB encoding and decoding Used under the hood by many applications (GDAL, QGIS, PostGIS, MapServer, GRASS, GeoDjango, ...) [geos.osgeo.org](http://geos.osgeo.org) ??? JTS itself also used in GeoServer, GeoTools, ... --- # Simple features Simple feature access - OGC / ISO standard: ![:scale 100%](img/simple_features_3_text.svg) --- # Spatial predicates https://en.wikipedia.org/wiki/DE-9IM .center[ ![:scale 80%](img/TopologicSpatialRelarions2.png) ] --- # Spatial operations
--- # Spatial operations
--- # GEOS
## Geometry Engine Open Source * C/C++ port of a subset of Java Topology Suite (JTS) * Most widely used geospatial C++ geometry library * Implements geometry objects (simple features), spatial predicate functions and spatial operations, prepared geometries, STR spatial index, WKT/WKB encoding and decoding Used under the hood by many applications (GDAL, QGIS, PostGIS, MapServer, GRASS, GeoDjango, ...) [geos.osgeo.org](http://geos.osgeo.org) --- class: middle, center # GEOS in the Python ecosystem --- # Shapely Python package for the manipulation and analysis of geometric objects
Pythonic interface to GEOS -- count:false .mmedium[ ```python >>> from shapely.geometry import Point, LineString, Polygon >>> point = Point(1, 1) >>> line = LineString([(0, 0), (1, 2), (2, 2)]) >>> poly = line.buffer(1) ``` ]
.mmedium[ ```python >>> poly.contains(point) True ``` ] -- count: false Nice interface to GEOS, but: single objects, no attributes ??? # Shapely typical predicates and operations --- .center[ ![:scale 60%](img/geopandas_logo.svg) ] Make working with tabular geospatial data in python easier by combining Shapely and pandas * Extends the pandas data analysis library to work with geographic objects and spatial operations * Combines the power of whole ecosystem of (geo) tools (pandas, geos, shapely, gdal, fiona, pyproj, rtree, ...) * Bridge between geospatial packages and the scientific / data science stack Documentation: http://geopandas.readthedocs.io/ ??? What Postgis is for databases/postgresql, GeoPandas is for Python/pandas make working with geospatial data like working with any other kind of data in python (data stack, numpy, pandas and other tools around those) analysis for which you otherwise would need desktop GIS applications (QGIS, ArcGIS) or geospatial databases (PostGIS) makes pandas objects geometry aware --- # GeoPandas Make working with tabular geospatial data in python easier by combining Shapely and pandas .small[ ```python >>> df = geopandas.read_file("ne_110m_admin_0_countries.shp") >>> df pop_est continent name iso_a3 gdp_md_est geometry 0 920938 Oceania Fiji FJI 8374.0 MULTIPOLYGON (((180.00000 ... 1 53950935 Africa Tanzania TZA 150600.0 POLYGON ((33.90371 -0.9500... 2 603253 Africa W. Sahara ESH 906.5 POLYGON ((-8.66559 27.6564... 3 35623680 North America Canada CAN 1674000.0 MULTIPOLYGON (((-122.84000... .. ... ... ... ... ... ... >>> df = df.to_crs(epsg=3857) >>> df.geometry.area / 1e9 0 21.283337 1 952.255175 2 117.102338 3 52166.480440 .. ... ``` ] ??? GeoPandas gives a very nice interface (although I am biased of course), but performance wise there is certainly room for improvement. --- # Why is GeoPandas slow? - GeoPandas stores custom Python (Shapely) objects in arrays - For operations, it iterates through those objects - The Shapely objects each call the GEOS C operation
.center[ ![:scale 60%](img/geopandas-shapely-1.svg) ] --- # Why is GeoPandas slow? - GeoPandas stores custom Python (Shapely) objects in arrays - For operations, it iterates through those objects - The Shapely objects each call the GEOS C operation
```python class GeoSeries: ... def distance(self, other): result = [geom.distance(other) for geom in self.geometry] return pd.Series(result) ``` --- # Making it faster - Move the loop into C and iterate directly over pointers to GEOS objects
.center[ ![:scale 60%](img/geopandas-shapely-2.svg) ] --- class: middle, center # Introducing PyGEOS --- # Introducing PyGEOS New library that exposes geospatial operations from GEOS into Python: - array-based + vectorized functions - fast -- count: false Started by Casper van der Wel:
https://caspervdw.github.io/Introducing-Pygeos/ GitHub repo:
https://github.com/pygeos/pygeos/ --- # Array-based Instead of (using Shapely) ```python [poly.contains(point) for point in points] ``` you can do ```python pygeos.contains(poly, points) ``` --- # Fast Benchmark for 1M points: contained in or distance to a polygon ![:scale 49%](img/pygeos_timings-contains.png) ![:scale 49%](img/pygeos_timings-distance.png) Significant performance increase: 80x (contains) to 5x (distance) for this example --- # Fast Spatial join of NYC neighborhoods with census blocks (example from https://postgis.net/workshops/postgis-intro/) ```python geopandas.sjoin( nyc_neighborhoods, nyc_census_blocks, op='intersects' ) ``` .center[ ![:scale 49%](img/geopandas-timings_sjoin_pygeos.png) ] ??? But depends a lot on the exact case (characteristics of the geometries being joined). Another artificial case of point-in-polygon with a grid, get 40x speed-up. --- # Running in parallel (WIP) Possibility to run in parallel (releasing the GIL) Combination with Dask (https://dask.org/): .mmedium[ ```python # with pygeos, single core res1 = pygeos.distance(points, poly) ``` ] .mmedium[ ```python # chunked using dask, multi-threaded points_chunked = dask.array.from_array(points, chunks=100_000) res2 = points_chunked.map_blocks(pygeos.distance, poly, dtype=float) ``` ] -- count: false -> 3x speed-up on my 4 core laptop See alos: https://caspervdw.github.io/Pygeos-Multithreading/ --- # PyGEOS implementation ? * `pygeos.Geometry` Python C extension type holding pointer to GEOS Geometry object * Extension type ensures garbage collection on the Python level, but the pointer is accessible from C without overhead * The ufuncs are implemented in C using the numpy C API --- class: middle, center # Shapely 2.0 --- # RFC: Roadmap for Shapely 2.0 See full proposal at https://github.com/shapely/shapely-rfc/pull/1/ * Fully move the code of PyGEOS into Shapely * Refactor the internals of Shapely to use C extension type instead of `ctypes` for the Geometry class * Provide array functionality * Preserve familiar Shapely interface for single geometries * Set of API changes * Numpy now a required dependency --- # RFC: Roadmap for Shapely 2.0 See full proposal at https://github.com/shapely/shapely-rfc/pull/1/ * Set of API changes: * Make the Geometry objects immutable + hashable * Remove "array interface" (`np.array(line)` -> `np.array(line.coords)`) * Multi-part geometries are no longer iterable (`list(multi_polygon)` -> `list(multi_polygon.geoms)`) --- # Next steps * Complete the refactor of Shapely ([GH-962](https://github.com/Toblerity/Shapely/issues/962)): * Shapely 1.8: include deprecation warnings * Shapely 2.0: refactored version * GeoPandas can already optionally use PyGEOS * Parallelize / distribute GeoPandas using dask (https://github.com/jsignell/dask-geopandas/) ??? gif of dashboard of dask with running a sjoin in parallel? --- # How can you help? * Try out GeoPandas with PyGEOS and give feedback * https://geopandas.readthedocs.io/en/latest/getting_started/install.html#using-the-optional-pygeos-dependency * Try out / contribute to PyGEOS * https://github.com/pygeos/pygeos/issues * Test and update your packages once Shapely 1.8 is released ??? * Prepared geometries * More coverage of GEOS functions --- class: middle # Thanks for listening! Questions? ### Thanks to Casper Van der Wel and the Shapely developers for the collaboration ### Those slides: - https://github.com/jorisvandenbossche/talks/ - [jorisvandenbossche.github.io/talks/2020_GeoPython_Shapely-2.0](http://jorisvandenbossche.github.io/talks/2020_GeoPython_Shapely-2.0)