+ - 0:00:00
Notes for current slide
Notes for next slide

GEOS in the Python ecosystem

Joris Van den Bossche, FOSS4G Belgium, October 24, 2019

https://github.com/jorisvandenbossche/talks/

@jorisvdbossche

1 / 28

About me

Joris Van den Bossche

  • Background: PhD bio-science engineer, air quality research
  • Open source enthusiast: pandas core dev, geopandas maintainer, scikit-learn contributor
  • Currently freelance open source developer and teacher, working part-time on Apache Arrow (at Ursa Labs)

https://github.com/jorisvandenbossche Twitter: @jorisvdbossche

2 / 28

Vector processing in QGIS

3 / 28

Vector processing in QGIS

⮕ using the GEOS library under the hood.
3 / 28

Vector processing in Postgis

Example from https://postgis.net/workshops/postgis-intro/:

SELECT
subways.name AS subway_name,
neighborhoods.name AS neighborhood_name,
neighborhoods.boroname AS borough
FROM nyc_neighborhoods AS neighborhoods
JOIN nyc_subway_stations AS subways
ON ST_Contains(neighborhoods.geom, subways.geom)
WHERE subways.name = 'Broad St';
4 / 28

Vector processing in Postgis

Example from https://postgis.net/workshops/postgis-intro/:

SELECT
subways.name AS subway_name,
neighborhoods.name AS neighborhood_name,
neighborhoods.boroname AS borough
FROM nyc_neighborhoods AS neighborhoods
JOIN nyc_subway_stations AS subways
ON ST_Contains(neighborhoods.geom, subways.geom)
WHERE subways.name = 'Broad St';
⮕ using the GEOS library under the hood.
4 / 28

Vector processing in R (sf)

Snippets from presentation last year (https://pokyah.shinyapps.io/foss4GBXL2018):

library(sf)
belgium = sf::st_as_sf(
rnaturalearth::ne_states(country = 'belgium'))
wallonia = belgium %>% dplyr::filter(region == "Walloon")
grid = sf::st_intersection(
grid, sf::st_transform(wallonia, crs = 3812))
5 / 28

Vector processing in R (sf)

Snippets from presentation last year (https://pokyah.shinyapps.io/foss4GBXL2018):

library(sf)
belgium = sf::st_as_sf(
rnaturalearth::ne_states(country = 'belgium'))
wallonia = belgium %>% dplyr::filter(region == "Walloon")
grid = sf::st_intersection(
grid, sf::st_transform(wallonia, crs = 3812))
⮕ using the GEOS library under the hood.
5 / 28

Vector processing in Python

Using Shapely and GeoPandas:

import geopandas
import shapely.geometry
districts = geopandas.read_file("paris_districts.gpkg")
notre_dame = shapely.geometry.Point(452321, 5411311)
# filter districts that contain the point
districts[districts.contains(notre_dame)]
6 / 28

Vector processing in Python

Using Shapely and GeoPandas:

import geopandas
import shapely.geometry
districts = geopandas.read_file("paris_districts.gpkg")
notre_dame = shapely.geometry.Point(452321, 5411311)
# filter districts that contain the point
districts[districts.contains(notre_dame)]
⮕ using the GEOS library under the hood.
6 / 28

GEOS

Geometry Engine Open Source

  • C/C++ port of a subset of Java Topology Suite (JTS)
  • Most widely used geospatial C++ geometry library
  • Implements geometry objects (simple features), spatial predicate functions and spatial operations, prepared geometries, STR spatial index, WKT/WKB encoding and decoding

Used under the hood by many applications (GDAL, QGIS, PostGIS, MapServer, GRASS, GeoDjango, ...)

geos.osgeo.org

7 / 28

JTS itself also used in GeoServer, GeoTools, ...

Simple features

Simple feature access - OGC / ISO standard:

8 / 28

Spatial predicates

https://en.wikipedia.org/wiki/DE-9IM

9 / 28

Spatial operations

10 / 28

Spatial operations

11 / 28

GEOS

Geometry Engine Open Source

  • C/C++ port of a subset of Java Topology Suite (JTS)
  • Most widely used geospatial C++ geometry library
  • Implements geometry objects (simple features), spatial predicate functions and spatial operations, prepared geometries, STR spatial index, WKT/WKB encoding and decoding

Used under the hood by many applications (GDAL, QGIS, PostGIS, MapServer, GRASS, GeoDjango, ...)

geos.osgeo.org

12 / 28

GEOS in the Python ecosystem

13 / 28

Shapely

Python package for the manipulation and analysis of geometric objects

Pythonic interface to GEOS

14 / 28

Shapely

Python package for the manipulation and analysis of geometric objects

Pythonic interface to GEOS

>>> from shapely.geometry import Point, LineString, Polygon
>>> point = Point(1, 1)
>>> line = LineString([(0, 0), (1, 2), (2, 2)])
>>> poly = line.buffer(1)

              

>>> poly.contains(point)
True
14 / 28

Shapely

Python package for the manipulation and analysis of geometric objects

Pythonic interface to GEOS

>>> from shapely.geometry import Point, LineString, Polygon
>>> point = Point(1, 1)
>>> line = LineString([(0, 0), (1, 2), (2, 2)])
>>> poly = line.buffer(1)

              

>>> poly.contains(point)
True

Nice interface to GEOS, but: single objects, no attributes

14 / 28

Shapely

typical predicates and operations

GeoPandas

Make working with tabular geospatial data in python easier by combining Shapely and pandas

  • Extends the pandas data analysis library to work with geographic objects and spatial operations
  • Combines the power of whole ecosystem of (geo) tools (pandas, geos, shapely, gdal, fiona, pyproj, rtree, ...)
  • Bridge between geospatial packages and the scientific / data science stack

Documentation: http://geopandas.readthedocs.io/

15 / 28

What Postgis is for databases/postgresql, GeoPandas is for Python/pandas

make working with geospatial data like working with any other kind of data in python (data stack, numpy, pandas and other tools around those) analysis for which you otherwise would need desktop GIS applications (QGIS, ArcGIS) or geospatial databases (PostGIS)

makes pandas objects geometry aware

GeoPandas

Make working with tabular geospatial data in python easier by combining Shapely and pandas

>>> df = geopandas.read_file("ne_110m_admin_0_countries.shp")
>>> df
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Oceania Fiji FJI 8374.0 MULTIPOLYGON (((180.00000 ...
1 53950935 Africa Tanzania TZA 150600.0 POLYGON ((33.90371 -0.9500...
2 603253 Africa W. Sahara ESH 906.5 POLYGON ((-8.66559 27.6564...
3 35623680 North America Canada CAN 1674000.0 MULTIPOLYGON (((-122.84000...
.. ... ... ... ... ... ...
>>> df = df.to_crs(epsg=3857)
>>> df.geometry.area / 1e9
0 21.283337
1 952.255175
2 117.102338
3 52166.480440
.. ...
16 / 28

Why is GeoPandas slow?

  • GeoPandas stores custom Python objects in arrays
  • For operations, it iterates through those objects
  • Those Python objects each call the GEOS C operation
17 / 28

Why is GeoPandas slow?

  • GeoPandas stores custom Python objects in arrays
  • For operations, it iterates through those objects
  • Those Python objects each call the GEOS C operation
class GeoSeries:
...
def distance(self, other):
result = [geom.distance(other) for geom in self.geometry]
return pd.Series(result)
18 / 28

Introducing PyGEOS

19 / 28

Introducing PyGEOS

New library that exposes geospatial operations from GEOS into Python:

  • array-based
  • fast
20 / 28

Introducing PyGEOS

New library that exposes geospatial operations from GEOS into Python:

  • array-based
  • fast

Started by Casper van der Wel:
https://caspervdw.github.io/Introducing-Pygeos/

GitHub repo:
https://github.com/pygeos/pygeos/

20 / 28

Array-based

Instead of (using Shapely)

[poly.contains(point) for point in points]

you can do

pygeos.contains(poly, points)
21 / 28

Fast

Benchmark for 1M points: contained in or distance to a polygon

Significant performance increase: 80x (contains) to 5x (distance) for this example

22 / 28

Numpy "universal functions"

Numpy universal functions (ufuncs) are vectorized functions that work on arrays element-by-element supporting numpy features such as broadcasting

Demo!

23 / 28

Running in parallel (WIP)

Possibility to run in parallel (releasing the GIL)

Combination with Dask (https://dask.org/):

# with pygeos, single core
res1 = pygeos.distance(points, poly)
# chunked using dask, multi-threaded
points_chunked = dask.array.from_array(points, chunks=100_000)
res2 = points_chunked.map_blocks(pygeos.distance, poly, dtype=float)
24 / 28

Running in parallel (WIP)

Possibility to run in parallel (releasing the GIL)

Combination with Dask (https://dask.org/):

# with pygeos, single core
res1 = pygeos.distance(points, poly)
# chunked using dask, multi-threaded
points_chunked = dask.array.from_array(points, chunks=100_000)
res2 = points_chunked.map_blocks(pygeos.distance, poly, dtype=float)

-> 3x speed-up on my 4 core laptop

24 / 28

PyGEOS implementation ?

  • pygeos.Geometry Python C extension type holding pointer to GEOS Geometry object
  • Extension type ensures garbage collection on the Python level, but the pointer is accessible from C without overhead
  • The ufuncs are implemented in C using the numpy C API
25 / 28

Further work

  • Speed-up GeoPandas by leveraging PyGEOS
26 / 28

Further work

  • Speed-up GeoPandas by leveraging PyGEOS
  • Integration with Shapely?
26 / 28

Further work

  • Speed-up GeoPandas by leveraging PyGEOS
  • Integration with Shapely?
  • Spatial index (STRTree), spatial join
  • Prepared geometries
  • More coverage of GEOS functions
  • ...

https://github.com/pygeos/pygeos/issues

26 / 28

Want to try out? Contribute?

Docs: https://pygeos.readthedocs.io

Install using conda:

$ conda install --channel conda-forge pygeos

Contribute: https://github.com/pygeos/pygeos/

27 / 28

Want to try out? Contribute?

Docs: https://pygeos.readthedocs.io

Install using conda:

$ conda install --channel conda-forge pygeos

Contribute: https://github.com/pygeos/pygeos/

Feedback and contributions very welcome!

27 / 28

Thanks for listening! Questions?

Thanks to Casper Van der Wel for the collaboration

Those slides:

http://pygeos.readthedocs.io

28 / 28

About me

Joris Van den Bossche

  • Background: PhD bio-science engineer, air quality research
  • Open source enthusiast: pandas core dev, geopandas maintainer, scikit-learn contributor
  • Currently freelance open source developer and teacher, working part-time on Apache Arrow (at Ursa Labs)

https://github.com/jorisvandenbossche Twitter: @jorisvdbossche

2 / 28
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow