GeoPandas now uses pyproj.CRS and catches up with PROJ 6

Short summary: the upcoming 0.7.0 release of GeoPandas will start using pyproj.CRS to represent the Coordinate Reference System of a GeoDataFrame. This brings along a better user interface, many changes and improvements from PROJ 6, but might also require some changes in your code (getting rid of proj4 strings).

What is PROJ ?

To quote from their website, PROJ is "a generic coordinate transformation software that transforms geospatial coordinates from one coordinate reference system (CRS) to another. This includes cartographic projections as well as geodetic transformations".
PROJ is a foundational piece of the open source geospatial ecosystem providing the functionality to transform coordinates in many projects such as GDAL, QGIS, PostGIS, ... and also in GeoPandas.

Over the last years, PROJ has seen a lot of improvements (through the GDAL barn fundraising): a unified CRS database (now included in PROJ), better WKT2 support for describing a CRS in a standardized way ("Well Known Text" format), and more accurate transformations between CRS with different datums. This culminated in the PROJ 6 release (see the release notes)

Following the changes in PROJ, the pyproj package (which provides python bindings to PROJ) introduced the pyproj.CRS object to represent a Coordinate Reference System with a user-friendly interface. This has some consequences, that I will try to explain below.

Back to GeoPandas: how was/is CRS information stored?

In GeoPandas, the .crs attribute stores the CRS of the GeoDataFrame, and up to now (version 0.6), this was stored as a "proj4 string" (or a dictionary representation of it).

For example, you would see things like this:

>>> gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_cities"))
>>> gdf.crs
{'init': 'epsg:4326'}

The above is the dictionary form of "+init=epsg:4326", a proj4 string using an EPSG code to describe the CRS. And a full form proj4 string could for example look like this (for the projected CRS EPSG:31370 used in Belgium):

"+proj=lcc +lat_0=90 +lon_0=4.36748666666667 +lat_1=51.1666672333333 +lat_2=49.8333339 +x_0=150000.013 +y_0=5400088.438 +ellps=intl +units=m +no_defs +type=crs"

The above is now the past: starting with GeoPandas 0.7, the CRS information will be stored as a pyproj.CRS object, which is a richer representation of a coordinate reference system.
Repeating the code sample from above, but now using the upcoming GeoPandas 0.7 in combination with pyproj 2.4, we get:

In [2]:
import geopandas
gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_cities"))
gdf.crs
Out[2]:
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
In [3]:
type(gdf.crs)
Out[3]:
pyproj.crs.CRS

Moving away from "proj4 strings"

With the changes in PROJ 6, the PROJ community wants to move away from using proj4 strings to represent a CRS, as we did up to now. Why? Because the proj4 string is limited and cannot faithfully describe a certain CRS. Using a proj4 string (instead of e.g. a WKT string) loses valuable information about the CRS (e.g. the name, the exact datum, the area of use, etc), possibly resulting in less precise transformations.

What should be used instead? The most recommended formats are "Well Known Text" (WKT) strings and AUTHORITY:CODE identifiers (where the authority typically is EPSG). In practice, using the EPSG code will work in many cases. For example "EPSG:4326" for geographical coordinates (WGS84) or "EPSG:3857" for projected coordinates in the Web Mercator projection.

See also: https://proj.org/faq.html#what-is-the-best-format-for-describing-coordinate-reference-systems

(Sidenote: there are still use cases for proj4 strings, such as for describing transformation pipelines in PROJ, or in cases you don't care about the specific datum, but in general not for describing a CRS).

The new pyproj.CRS class

As shown above, the .crs attribute now returns a pyproj.CRS. And you can already see that the representation of this objects is much more informative than the proj4 string before (it includes its name, whether it is geographic or projected, the area of use, the datum, ...):

In [4]:
gdf.crs
Out[4]:
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

This object now provides a much richer and more user-friendly interface compared to the old proj4 strings/dicts. Apart from the repr, the above information is also available through attributes or methods:

In [5]:
gdf.crs.name
Out[5]:
'WGS 84'
In [6]:
gdf.crs.datum
Out[6]:
DATUM["World Geodetic System 1984",
    ELLIPSOID["WGS 84",6378137,298.257223563,
        LENGTHUNIT["metre",1]],
    ID["EPSG",6326]]

Or for a projected CRS:

In [7]:
import pyproj
crs = pyproj.CRS("EPSG:31370")
crs
Out[7]:
<Projected CRS: EPSG:31370>
Name: Belge 1972 / Belgian Lambert 72
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: Belgium - onshore
- bounds: (2.5, 49.5, 6.4, 51.51)
Coordinate Operation:
- name: Belgian Lambert 72
- method: Lambert Conic Conformal (2SP)
Datum: Reseau National Belge 1972
- Ellipsoid: International 1924
- Prime Meridian: Greenwich
In [8]:
crs.is_geographic
Out[8]:
False
In [9]:
crs.is_projected
Out[9]:
True

Will this change break my code?

This is a big change for GeoPandas, one that is required to follow the PROJ community but I think also a nice change that improves the usability. Unfortunately, it will also require some transition work depending on your workflow.

When reading geospatial files with geopandas.read_file, things should mostly work out of the box. But when specifying the CRS manually in your code, this will require a first clear change. Currently, a lot of people (and also the GeoPandas docs showed that) specify the EPSG code using the "init" proj4 string:

## OLD
GeoDataFrame(..., crs={'init': 'epsg:4326'})
# or
gdf.crs = {'init': 'epsg:4326'}
# or
gdf.to_crs({'init': 'epsg:4326'})

The above will now raise a deprecation warning from pyproj, and instead of the "init" proj4 string, you should use only the EPSG code itself as follows:

## NEW
GeoDataFrame(..., crs="EPSG:4326")
# or
gdf.crs = "EPSG:4326"
# or
gdf.to_crs("EPSG:4326")

See the pyproj docs for more on this. If you used a full proj4 string, it is also recommended to change it with an EPSG code if possible.

One actual breaking change is that the returned value from crs is no longer a string or dict. So if you relied on this aspect, an update will be needed. For example, I have seen this code in the wild to get the EPSG code:

gdf.crs['init']
# or 
'init' in gdf.crs

This will no longer work. To get the EPSG code from a crs object, you can use the to_epsg() method. And there are many other methods available on the CRS class to get information about the CRS.

There are probably other (unforeseen) cases that might require updating your code. If you encounter any problems when upgrading to GeoPandas 0.7, please provide feedback on Github! That way we can try to smooth this migration with ironing out issues or improving the documentation on how to upgrade.

Thanks to the PROJ and pyproj communities!

This is an important change for GeoPandas, providing better and more user-friendly handling of Coordinate Reference Systems. And all that is only possible thanks to the PROJ and pyproj projects (and special thanks to Even Rouault for a lot of the PROJ work, and to Alan Snow for his work on pyproj and integrating this in GeoPandas).

Comments