GeoPandas now uses the pandas ExtensionArray interface
Short summary: the upcoming 0.6.0 release of GeoPandas will feature a refactor based on the pandas ExtensionArray interface. Although this change should keep the user interface mostly stable, it enables more robust integration with pandas and allows for more upcoming changes in the future. And given the invasive code changes under the hood, testing is very welcome!
GeoPandas extends the pandas
data analysis library to enable spatial operations on geometric types. More specifically, it provides the GeoSeries
and GeoDataFrame
classes (sublcasses of the pandas Series
and DataFrame
) to work with geospatial vector datasets.
In the last releases, pandas focused on extensibility. It introduced the "pandas ExtensionArray interface", which allows third-party libraries to specify custom data types that extend numpy data types and how they should be handled within pandas. A perfect fit for GeoPandas, and being able to use this in GeoPandas was one of the drivers for me to contribute to those developments in pandas. See this blog post for more about ExtensionArrays.
The upcoming 0.6.0 release of GeoPandas features a refactor of the internals of GeoPandas, now finally using this new ExtensionArray interface of pandas. This are mainly code changes under the hood, without really changing how GeoPandas gets used. But it will more improvements in the future.
import geopandas
geopandas.__version__
A geometry dtype¶
Let's look at a few of the changes. Reading the built-in New York boroughs file:
gdf = geopandas.read_file(geopandas.datasets.get_path('nybb'))
gdf
Up to now, everything is as familiar. The main difference towards the user is that the 'geometry' column in the above GeoDataFrame is no longer of object
dtype (the "catch all " dtype in pandas that can hold any Python object):
gdf.dtypes
We now see that our geometry column has a geometry
data type !
Apart from being more user friendly than the generic "object", this also ensures that the values in that column are all actual geometry objects (or the missing value indicator).
The underlying array values of a GeoSeries
is now the custom GeometryArray
(the array-like implemented in GeoPandas that follows the pandas ExtensionArray interface):
gdf.geometry.values
type(gdf.geometry.values)
While before, this would have been a numpy array. You can still get this by explicitly converting to a numpy array:
np.asarray(gdf.geometry)
Note that this is not a "native" geometry data type. It still stores the Shapely objects in a object-dtype numpy array under the hood, but now wrapped in the GeometryArray
to better integrate in pandas. Or, at least, for now. The current release does not contain much fancy new features. It mainly tries to give the existing GeoPandas experience, based on this new interface with pandas under the hood. But, it will allow more exciting changes in the near future! See below on future performance improvements.
Missing geometries versus empty geometries¶
Not exactly related to the ExtensionArray refactor, but at the same time we tried to make the missing data handling more consistent within GeoPandas.
Historically, missing ("NA") values in a GeoSeries could be represented by empty geometric objects, in addition to standard representations such as None
and np.nan
. At least, this was the case in GeoSeries.isna()
or when GeoSeries got aligned in geospatial operations. But, other methods like dropna
and fillna
did not follow this approach and did not consider empty geometries as missing.
In the upcoming 0.6.0 release, we have changed this behaviour to be more in line in pandas and to be consistent within GeoPandas: only actual missing values are considered missing:
from shapely.geometry import Polygon
s = geopandas.GeoSeries([Polygon([(0, 0), (1, 1), (0, 1)]), None, Polygon([])])
s
The GeoSeries.isna()
method now only returns True
for the missing value (the second element):
s.isna()
If you want to know which values are empty geometries, you can use the existing GeoSeries.is_empty
:
s.is_empty
Or a combination of both to get the previous behaviour of GeoSeries.isna()
to detect both missing or empty geometries:
s.is_empty | s.isna()
Please test!¶
On the surface, this doesn't change much. Your code using GeoPandas should still work as you have been using it before. But under the hood quite some changes were made. Such a refactor can always come with some unintended side effects. So this is a call for trying out this new version in your applications!
You can update your GeoPandas install to the release candidate with:
conda install --channel conda-forge/label/rc geopandas=0.6.0rc1
# or with pip:
pip install --pre geopandas==0.6.0rc1
If you encounter any issues, please report them at https://github.com/geopandas/geopandas/issues
Upcoming performance improvements¶
With this initial 0.6.0 release using the pandas ExtensionArray interface, we didn't change much functionality of GeoPandas, yet. But it will allow us to focus for the next release on important performance improvements. I already blogged about that quite some time ago, but now that the 0.6.0 refactor is done, we can now finally work towards landing this in the near future in GeoPandas itself!