GeoPandas now uses the pandas ExtensionArray interface

Short summary: the upcoming 0.6.0 release of GeoPandas will feature a refactor based on the pandas ExtensionArray interface. Although this change should keep the user interface mostly stable, it enables more robust integration with pandas and allows for more upcoming changes in the future. And given the invasive code changes under the hood, testing is very welcome!


GeoPandas extends the pandas data analysis library to enable spatial operations on geometric types. More specifically, it provides the GeoSeries and GeoDataFrame classes (sublcasses of the pandas Series and DataFrame) to work with geospatial vector datasets.

In the last releases, pandas focused on extensibility. It introduced the "pandas ExtensionArray interface", which allows third-party libraries to specify custom data types that extend numpy data types and how they should be handled within pandas. A perfect fit for GeoPandas, and being able to use this in GeoPandas was one of the drivers for me to contribute to those developments in pandas. See this blog post for more about ExtensionArrays.

The upcoming 0.6.0 release of GeoPandas features a refactor of the internals of GeoPandas, now finally using this new ExtensionArray interface of pandas. This are mainly code changes under the hood, without really changing how GeoPandas gets used. But it will more improvements in the future.

In [2]:
import geopandas
geopandas.__version__
Out[2]:
'0.6.0rc1'

A geometry dtype

Let's look at a few of the changes. Reading the built-in New York boroughs file:

In [3]:
gdf = geopandas.read_file(geopandas.datasets.get_path('nybb'))
In [4]:
gdf
Out[4]:
BoroCode BoroName Shape_Leng Shape_Area geometry
0 5 Staten Island 330470.010332 1.623820e+09 MULTIPOLYGON (((970217.0223999023 145643.33221...
1 4 Queens 896344.047763 3.045213e+09 MULTIPOLYGON (((1029606.076599121 156073.81420...
2 3 Brooklyn 741080.523166 1.937479e+09 MULTIPOLYGON (((1021176.479003906 151374.79699...
3 1 Manhattan 359299.096471 6.364715e+08 MULTIPOLYGON (((981219.0557861328 188655.31579...
4 2 Bronx 464392.991824 1.186925e+09 MULTIPOLYGON (((1012821.805786133 229228.26458...

Up to now, everything is as familiar. The main difference towards the user is that the 'geometry' column in the above GeoDataFrame is no longer of object dtype (the "catch all " dtype in pandas that can hold any Python object):

In [5]:
gdf.dtypes
Out[5]:
BoroCode         int64
BoroName        object
Shape_Leng     float64
Shape_Area     float64
geometry      geometry
dtype: object

We now see that our geometry column has a geometry data type !

Apart from being more user friendly than the generic "object", this also ensures that the values in that column are all actual geometry objects (or the missing value indicator).

The underlying array values of a GeoSeries is now the custom GeometryArray (the array-like implemented in GeoPandas that follows the pandas ExtensionArray interface):

In [6]:
gdf.geometry.values
Out[6]:
<GeometryArray>
[<shapely.geometry.multipolygon.MultiPolygon object at 0x7f340afbf8d0>,
 <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21ac8>,
 <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21b00>,
 <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af218d0>,
 <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21588>]
Length: 5, dtype: geometry
In [7]:
type(gdf.geometry.values)
Out[7]:
geopandas.array.GeometryArray

While before, this would have been a numpy array. You can still get this by explicitly converting to a numpy array:

In [8]:
np.asarray(gdf.geometry)
Out[8]:
array([<shapely.geometry.multipolygon.MultiPolygon object at 0x7f340afbf8d0>,
       <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21ac8>,
       <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21b00>,
       <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af218d0>,
       <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21588>],
      dtype=object)

Note that this is not a "native" geometry data type. It still stores the Shapely objects in a object-dtype numpy array under the hood, but now wrapped in the GeometryArray to better integrate in pandas. Or, at least, for now. The current release does not contain much fancy new features. It mainly tries to give the existing GeoPandas experience, based on this new interface with pandas under the hood. But, it will allow more exciting changes in the near future! See below on future performance improvements.

Missing geometries versus empty geometries

Not exactly related to the ExtensionArray refactor, but at the same time we tried to make the missing data handling more consistent within GeoPandas. Historically, missing ("NA") values in a GeoSeries could be represented by empty geometric objects, in addition to standard representations such as None and np.nan. At least, this was the case in GeoSeries.isna() or when GeoSeries got aligned in geospatial operations. But, other methods like dropna and fillna did not follow this approach and did not consider empty geometries as missing.

In the upcoming 0.6.0 release, we have changed this behaviour to be more in line in pandas and to be consistent within GeoPandas: only actual missing values are considered missing:

In [9]:
from shapely.geometry import Polygon
In [10]:
s = geopandas.GeoSeries([Polygon([(0, 0), (1, 1), (0, 1)]), None, Polygon([])])
In [11]:
s
Out[11]:
0    POLYGON ((0 0, 1 1, 0 1, 0 0))
1                              None
2          GEOMETRYCOLLECTION EMPTY
dtype: geometry

The GeoSeries.isna() method now only returns True for the missing value (the second element):

In [12]:
s.isna()
Out[12]:
0    False
1     True
2    False
dtype: bool

If you want to know which values are empty geometries, you can use the existing GeoSeries.is_empty:

In [13]:
s.is_empty
Out[13]:
0    False
1    False
2     True
dtype: bool

Or a combination of both to get the previous behaviour of GeoSeries.isna() to detect both missing or empty geometries:

In [14]:
s.is_empty | s.isna()
Out[14]:
0    False
1     True
2     True
dtype: bool

Please test!

On the surface, this doesn't change much. Your code using GeoPandas should still work as you have been using it before. But under the hood quite some changes were made. Such a refactor can always come with some unintended side effects. So this is a call for trying out this new version in your applications!

You can update your GeoPandas install to the release candidate with:

conda install --channel conda-forge/label/rc geopandas=0.6.0rc1
# or with pip:
pip install --pre geopandas==0.6.0rc1

If you encounter any issues, please report them at https://github.com/geopandas/geopandas/issues

Upcoming performance improvements

With this initial 0.6.0 release using the pandas ExtensionArray interface, we didn't change much functionality of GeoPandas, yet. But it will allow us to focus for the next release on important performance improvements. I already blogged about that quite some time ago, but now that the 0.6.0 refactor is done, we can now finally work towards landing this in the near future in GeoPandas itself!

Comments