Short summary: the upcoming 0.6.0 release of GeoPandas will feature a refactor based on the pandas ExtensionArray interface. Although this change should keep the user interface mostly stable, it enables more robust integration with pandas and allows for more upcoming changes in the future. And given the invasive code changes under the hood, testing is very welcome!
GeoPandas extends the
pandas data analysis library to enable spatial operations on geometric types. More specifically, it provides the
GeoDataFrame classes (sublcasses of the pandas
DataFrame) to work with geospatial vector datasets.
In the last releases, pandas focused on extensibility. It introduced the "pandas ExtensionArray interface", which allows third-party libraries to specify custom data types that extend numpy data types and how they should be handled within pandas. A perfect fit for GeoPandas, and being able to use this in GeoPandas was one of the drivers for me to contribute to those developments in pandas. See this blog post for more about ExtensionArrays.
The upcoming 0.6.0 release of GeoPandas features a refactor of the internals of GeoPandas, now finally using this new ExtensionArray interface of pandas. This are mainly code changes under the hood, without really changing how GeoPandas gets used. But it will more improvements in the future.
import geopandas geopandas.__version__
Let's look at a few of the changes. Reading the built-in New York boroughs file:
gdf = geopandas.read_file(geopandas.datasets.get_path('nybb'))
|0||5||Staten Island||330470.010332||1.623820e+09||MULTIPOLYGON (((970217.0223999023 145643.33221...|
|1||4||Queens||896344.047763||3.045213e+09||MULTIPOLYGON (((1029606.076599121 156073.81420...|
|2||3||Brooklyn||741080.523166||1.937479e+09||MULTIPOLYGON (((1021176.479003906 151374.79699...|
|3||1||Manhattan||359299.096471||6.364715e+08||MULTIPOLYGON (((981219.0557861328 188655.31579...|
|4||2||Bronx||464392.991824||1.186925e+09||MULTIPOLYGON (((1012821.805786133 229228.26458...|
Up to now, everything is as familiar. The main difference towards the user is that the 'geometry' column in the above GeoDataFrame is no longer of
object dtype (the "catch all " dtype in pandas that can hold any Python object):
BoroCode int64 BoroName object Shape_Leng float64 Shape_Area float64 geometry geometry dtype: object
We now see that our geometry column has a
geometry data type !
Apart from being more user friendly than the generic "object", this also ensures that the values in that column are all actual geometry objects (or the missing value indicator).
The underlying array values of a
GeoSeries is now the custom
GeometryArray (the array-like implemented in GeoPandas that follows the pandas ExtensionArray interface):
<GeometryArray> [<shapely.geometry.multipolygon.MultiPolygon object at 0x7f340afbf8d0>, <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21ac8>, <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21b00>, <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af218d0>, <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21588>] Length: 5, dtype: geometry
While before, this would have been a numpy array. You can still get this by explicitly converting to a numpy array:
array([<shapely.geometry.multipolygon.MultiPolygon object at 0x7f340afbf8d0>, <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21ac8>, <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21b00>, <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af218d0>, <shapely.geometry.multipolygon.MultiPolygon object at 0x7f340af21588>], dtype=object)
Note that this is not a "native" geometry data type. It still stores the Shapely objects in a object-dtype numpy array under the hood, but now wrapped in the
GeometryArray to better integrate in pandas. Or, at least, for now. The current release does not contain much fancy new features. It mainly tries to give the existing GeoPandas experience, based on this new interface with pandas under the hood. But, it will allow more exciting changes in the near future! See below on future performance improvements.
Missing geometries versus empty geometries¶
Not exactly related to the ExtensionArray refactor, but at the same time we tried to make the missing data handling more consistent within GeoPandas.
Historically, missing ("NA") values in a GeoSeries could be represented by empty geometric objects, in addition to standard representations such as
np.nan. At least, this was the case in
GeoSeries.isna() or when GeoSeries got aligned in geospatial operations. But, other methods like
fillna did not follow this approach and did not consider empty geometries as missing.
In the upcoming 0.6.0 release, we have changed this behaviour to be more in line in pandas and to be consistent within GeoPandas: only actual missing values are considered missing:
from shapely.geometry import Polygon
s = geopandas.GeoSeries([Polygon([(0, 0), (1, 1), (0, 1)]), None, Polygon()])
0 POLYGON ((0 0, 1 1, 0 1, 0 0)) 1 None 2 GEOMETRYCOLLECTION EMPTY dtype: geometry
GeoSeries.isna() method now only returns
True for the missing value (the second element):
0 False 1 True 2 False dtype: bool
If you want to know which values are empty geometries, you can use the existing
0 False 1 False 2 True dtype: bool
Or a combination of both to get the previous behaviour of
GeoSeries.isna() to detect both missing or empty geometries:
s.is_empty | s.isna()
0 False 1 True 2 True dtype: bool
On the surface, this doesn't change much. Your code using GeoPandas should still work as you have been using it before. But under the hood quite some changes were made. Such a refactor can always come with some unintended side effects. So this is a call for trying out this new version in your applications!
You can update your GeoPandas install to the release candidate with:
conda install --channel conda-forge/label/rc geopandas=0.6.0rc1 # or with pip: pip install --pre geopandas==0.6.0rc1
If you encounter any issues, please report them at https://github.com/geopandas/geopandas/issues
Upcoming performance improvements¶
With this initial 0.6.0 release using the pandas ExtensionArray interface, we didn't change much functionality of GeoPandas, yet. But it will allow us to focus for the next release on important performance improvements. I already blogged about that quite some time ago, but now that the 0.6.0 refactor is done, we can now finally work towards landing this in the near future in GeoPandas itself!