This is a major release from 0.16.2 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Warning
pandas >= 0.17.0 will no longer support compatibility with Python version 3.2 (GH9118)
The pandas.io.data package is deprecated and will be replaced by the pandas-datareader package. This will allow the data modules to be independently updated to your pandas installation. The API for pandas-datareader v0.1.1 is exactly the same as in pandas v0.17.0 (GH8961, GH10861).
pandas.io.data
pandas-datareader v0.1.1
pandas v0.17.0
After installing pandas-datareader, you can easily change your imports:
from pandas.io import data, wb
becomes
from pandas_datareader import data, wb
Highlights include:
Release the Global Interpreter Lock (GIL) on some cython operations, see here
Plotting methods are now available as attributes of the .plot accessor, see here
.plot
The sorting API has been revamped to remove some long-time inconsistencies, see here
Support for a datetime64[ns] with timezones as a first-class dtype, see here
datetime64[ns]
The default for to_datetime will now be to raise when presented with unparseable formats, previously this would return the original input. Also, date parse functions now return consistent results. See here
to_datetime
raise
The default for dropna in HDFStore has changed to False, to store by default all rows even if they are all NaN, see here
dropna
HDFStore
False
NaN
Datetime accessor (dt) now supports Series.dt.strftime to generate formatted strings for datetime-likes, and Series.dt.total_seconds to generate each duration of the timedelta in seconds. See here
dt
Series.dt.strftime
Series.dt.total_seconds
Period and PeriodIndex can handle multiplied freq like 3D, which corresponding to 3 days span. See here
Period
PeriodIndex
3D
Development installed versions of pandas will now have PEP440 compliant version strings (GH9518)
PEP440
Development support for benchmarking with the Air Speed Velocity library (GH8361)
Support for reading SAS xport files, see here
Documentation comparing SAS to pandas, see here
Removal of the automatic TimeSeries broadcasting, deprecated since 0.8.0, see here
Display format with plain text can optionally align with Unicode East Asian Width, see here
Compatibility with Python 3.5 (GH11097)
Compatibility with matplotlib 1.5.0 (GH11111)
Check the API Changes and deprecations before updating.
What’s new in v0.17.0
New features
Datetime with TZ
Releasing the GIL
Plot submethods
Additional methods for dt accessor
strftime
total_seconds
Period frequency enhancement
Support for SAS XPORT files
Support for math functions in .eval()
Changes to Excel with MultiIndex
MultiIndex
Google BigQuery enhancements
Display alignment with Unicode East Asian width
Other enhancements
Backwards incompatible API changes
Changes to sorting API
Changes to to_datetime and to_timedelta
Error handling
Consistent parsing
Changes to Index comparisons
Changes to boolean comparisons vs. None
HDFStore dropna behavior
Changes to display.precision option
display.precision
Changes to Categorical.unique
Categorical.unique
Changes to bool passed as header in parsers
bool
header
Other API changes
Deprecations
Removal of prior version deprecations/changes
Performance improvements
Bug fixes
Contributors
We are adding an implementation that natively supports datetime with timezones. A Series or a DataFrame column previously could be assigned a datetime with timezones, and would work as an object dtype. This had performance issues with a large number rows. See the docs for more details. (GH8260, GH10763, GH11034).
Series
DataFrame
object
The new implementation allows for having a single-timezone across all rows, with operations in a performant manner.
In [1]: df = pd.DataFrame({'A': pd.date_range('20130101', periods=3), ...: 'B': pd.date_range('20130101', periods=3, tz='US/Eastern'), ...: 'C': pd.date_range('20130101', periods=3, tz='CET')}) ...: In [2]: df Out[2]: A B C 0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00 1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00 2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00 [3 rows x 3 columns] In [3]: df.dtypes Out[3]: A datetime64[ns] B datetime64[ns, US/Eastern] C datetime64[ns, CET] Length: 3, dtype: object
In [4]: df.B Out[4]: 0 2013-01-01 00:00:00-05:00 1 2013-01-02 00:00:00-05:00 2 2013-01-03 00:00:00-05:00 Name: B, Length: 3, dtype: datetime64[ns, US/Eastern] In [5]: df.B.dt.tz_localize(None) Out[5]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 Name: B, Length: 3, dtype: datetime64[ns]
This uses a new-dtype representation as well, that is very similar in look-and-feel to its numpy cousin datetime64[ns]
In [6]: df['B'].dtype Out[6]: datetime64[ns, US/Eastern] In [7]: type(df['B'].dtype) Out[7]: pandas.core.dtypes.dtypes.DatetimeTZDtype
Note
There is a slightly different string repr for the underlying DatetimeIndex as a result of the dtype changes, but functionally these are the same.
DatetimeIndex
Previous behavior:
In [1]: pd.date_range('20130101', periods=3, tz='US/Eastern') Out[1]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns]', freq='D', tz='US/Eastern') In [2]: pd.date_range('20130101', periods=3, tz='US/Eastern').dtype Out[2]: dtype('<M8[ns]')
New behavior:
In [8]: pd.date_range('20130101', periods=3, tz='US/Eastern') Out[8]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D') In [9]: pd.date_range('20130101', periods=3, tz='US/Eastern').dtype Out[9]: datetime64[ns, US/Eastern]
We are releasing the global-interpreter-lock (GIL) on some cython operations. This will allow other threads to run simultaneously during computation, potentially allowing performance improvements from multi-threading. Notably groupby, nsmallest, value_counts and some indexing operations benefit from this. (GH8882)
groupby
nsmallest
value_counts
For example the groupby expression in the following code will have the GIL released during the factorization step, e.g. df.groupby('key') as well as the .sum() operation.
df.groupby('key')
.sum()
N = 1000000 ngroups = 10 df = DataFrame({'key': np.random.randint(0, ngroups, size=N), 'data': np.random.randn(N)}) df.groupby('key')['data'].sum()
Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT), or performing multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is the dask library.
The Series and DataFrame .plot() method allows for customizing plot types by supplying the kind keyword arguments. Unfortunately, many of these kinds of plots use different required and optional keyword arguments, which makes it difficult to discover what any given plot kind uses out of the dozens of possible arguments.
.plot()
kind
To alleviate this issue, we have added a new, optional plotting interface, which exposes each kind of plot as a method of the .plot attribute. Instead of writing series.plot(kind=<kind>, ...), you can now also use series.plot.<kind>(...):
series.plot(kind=<kind>, ...)
series.plot.<kind>(...)
In [10]: df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b']) In [11]: df.plot.bar()
As a result of this change, these methods are now all discoverable via tab-completion:
In [12]: df.plot.<TAB> # noqa: E225, E999 df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.line df.plot.scatter df.plot.bar df.plot.box df.plot.hexbin df.plot.kde df.plot.pie
Each method signature only includes relevant arguments. Currently, these are limited to required arguments, but in the future these will include optional arguments, as well. For an overview, see the new Plotting API documentation.
We are now supporting a Series.dt.strftime method for datetime-likes to generate a formatted string (GH10110). Examples:
# DatetimeIndex In [13]: s = pd.Series(pd.date_range('20130101', periods=4)) In [14]: s Out[14]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 3 2013-01-04 Length: 4, dtype: datetime64[ns] In [15]: s.dt.strftime('%Y/%m/%d') Out[15]: 0 2013/01/01 1 2013/01/02 2 2013/01/03 3 2013/01/04 Length: 4, dtype: object
# PeriodIndex In [16]: s = pd.Series(pd.period_range('20130101', periods=4)) In [17]: s Out[17]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 3 2013-01-04 Length: 4, dtype: period[D] In [18]: s.dt.strftime('%Y/%m/%d') Out[18]: 0 2013/01/01 1 2013/01/02 2 2013/01/03 3 2013/01/04 Length: 4, dtype: object
The string format is as the python standard library and details can be found here
pd.Series of type timedelta64 has new method .dt.total_seconds() returning the duration of the timedelta in seconds (GH10817)
pd.Series
timedelta64
.dt.total_seconds()
# TimedeltaIndex In [19]: s = pd.Series(pd.timedelta_range('1 minutes', periods=4)) In [20]: s Out[20]: 0 0 days 00:01:00 1 1 days 00:01:00 2 2 days 00:01:00 3 3 days 00:01:00 Length: 4, dtype: timedelta64[ns] In [21]: s.dt.total_seconds() Out[21]: 0 60.0 1 86460.0 2 172860.0 3 259260.0 Length: 4, dtype: float64
Period, PeriodIndex and period_range can now accept multiplied freq. Also, Period.freq and PeriodIndex.freq are now stored as a DateOffset instance like DatetimeIndex, and not as str (GH7811)
period_range
Period.freq
PeriodIndex.freq
DateOffset
str
A multiplied freq represents a span of corresponding length. The example below creates a period of 3 days. Addition and subtraction will shift the period by its span.
In [22]: p = pd.Period('2015-08-01', freq='3D') In [23]: p Out[23]: Period('2015-08-01', '3D') In [24]: p + 1 Out[24]: Period('2015-08-04', '3D') In [25]: p - 2 Out[25]: Period('2015-07-26', '3D') In [26]: p.to_timestamp() Out[26]: Timestamp('2015-08-01 00:00:00') In [27]: p.to_timestamp(how='E') Out[27]: Timestamp('2015-08-03 23:59:59.999999999')
You can use the multiplied freq in PeriodIndex and period_range.
In [28]: idx = pd.period_range('2015-08-01', periods=4, freq='2D') In [29]: idx Out[29]: PeriodIndex(['2015-08-01', '2015-08-03', '2015-08-05', '2015-08-07'], dtype='period[2D]', freq='2D') In [30]: idx + 1 Out[30]: PeriodIndex(['2015-08-03', '2015-08-05', '2015-08-07', '2015-08-09'], dtype='period[2D]', freq='2D')
read_sas() provides support for reading SAS XPORT format files. (GH4052).
read_sas()
df = pd.read_sas('sas_xport.xpt')
It is also possible to obtain an iterator and read an XPORT file incrementally.
for df in pd.read_sas('sas_xport.xpt', chunksize=10000): do_something(df)
See the docs for more details.
eval() now supports calling math functions (GH4893)
eval()
df = pd.DataFrame({'a': np.random.randn(10)}) df.eval("b = sin(a)")
The support math functions are sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.
These functions map to the intrinsics for the NumExpr engine. For the Python engine, they are mapped to NumPy calls.
NumExpr
NumPy
In version 0.16.2 a DataFrame with MultiIndex columns could not be written to Excel via to_excel. That functionality has been added (GH10564), along with updating read_excel so that the data can be read back with, no loss of information, by specifying which columns/rows make up the MultiIndex in the header and index_col parameters (GH4679)
to_excel
read_excel
index_col
See the documentation for more details.
In [31]: df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], ....: columns=pd.MultiIndex.from_product( ....: [['foo', 'bar'], ['a', 'b']], names=['col1', 'col2']), ....: index=pd.MultiIndex.from_product([['j'], ['l', 'k']], ....: names=['i1', 'i2'])) ....: In [32]: df Out[32]: col1 foo bar col2 a b a b i1 i2 j l 1 2 3 4 k 5 6 7 8 [2 rows x 4 columns] In [33]: df.to_excel('test.xlsx') --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-33-70f4bf5ad766> in <module> ----> 1 df.to_excel('test.xlsx') ~/scipy/pandas/pandas/core/generic.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes) 2144 startcol=startcol, 2145 freeze_panes=freeze_panes, -> 2146 engine=engine, 2147 ) 2148 ~/scipy/pandas/pandas/io/formats/excel.py in write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine) 724 need_save = False 725 else: --> 726 writer = ExcelWriter(stringify_path(writer), engine=engine) 727 need_save = True 728 ~/scipy/pandas/pandas/io/excel/_openpyxl.py in __init__(self, path, engine, mode, **engine_kwargs) 16 def __init__(self, path, engine=None, mode="w", **engine_kwargs): 17 # Use the openpyxl module as the Excel writer. ---> 18 from openpyxl.workbook import Workbook 19 20 super().__init__(path, mode=mode, **engine_kwargs) ModuleNotFoundError: No module named 'openpyxl' In [34]: df = pd.read_excel('test.xlsx', header=[0, 1], index_col=[0, 1]) --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-34-9e36ddc96982> in <module> ----> 1 df = pd.read_excel('test.xlsx', header=[0, 1], index_col=[0, 1]) ~/scipy/pandas/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds) 301 302 if not isinstance(io, ExcelFile): --> 303 io = ExcelFile(io, engine=engine) 304 elif engine and engine != io.engine: 305 raise ValueError( ~/scipy/pandas/pandas/io/excel/_base.py in __init__(self, io, engine) 812 self._io = stringify_path(io) 813 --> 814 self._reader = self._engines[engine](self._io) 815 816 def __fspath__(self): ~/scipy/pandas/pandas/io/excel/_xlrd.py in __init__(self, filepath_or_buffer) 18 """ 19 err_msg = "Install xlrd >= 1.0.0 for Excel support" ---> 20 import_optional_dependency("xlrd", extra=err_msg) 21 super().__init__(filepath_or_buffer) 22 ~/scipy/pandas/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version) 89 except ImportError: 90 if raise_on_missing: ---> 91 raise ImportError(msg) from None 92 else: 93 return None ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd. In [35]: df Out[35]: col1 foo bar col2 a b a b i1 i2 j l 1 2 3 4 k 5 6 7 8 [2 rows x 4 columns]
Previously, it was necessary to specify the has_index_names argument in read_excel, if the serialized data had index names. For version 0.17.0 the output format of to_excel has been changed to make this keyword unnecessary - the change is shown below.
has_index_names
Old
New
Excel files saved in version 0.16.2 or prior that had index names will still able to be read in, but the has_index_names argument must specified to True.
True
Added ability to automatically create a table/dataset using the pandas.io.gbq.to_gbq() function if the destination table/dataset does not exist. (GH8325, GH11121).
pandas.io.gbq.to_gbq()
Added ability to replace an existing table and schema when calling the pandas.io.gbq.to_gbq() function via the if_exists argument. See the docs for more details (GH8325).
if_exists
InvalidColumnOrder and InvalidPageToken in the gbq module will raise ValueError instead of IOError.
InvalidColumnOrder
InvalidPageToken
ValueError
IOError
The generate_bq_schema() function is now deprecated and will be removed in a future version (GH11121)
generate_bq_schema()
The gbq module will now support Python 3 (GH11094).
Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times slower). Use only when it is actually required.
Some East Asian countries use Unicode characters its width is corresponding to 2 alphabets. If a DataFrame or Series contains these characters, the default output cannot be aligned properly. The following options are added to enable precise handling for these characters.
display.unicode.east_asian_width: Whether to use the Unicode East Asian Width to calculate the display text width. (GH2612)
display.unicode.east_asian_width
display.unicode.ambiguous_as_wide: Whether to handle Unicode characters belong to Ambiguous as Wide. (GH11102)
display.unicode.ambiguous_as_wide
In [36]: df = pd.DataFrame({u'国籍': ['UK', u'日本'], u'名前': ['Alice', u'しのぶ']}) In [37]: df;
In [38]: pd.set_option('display.unicode.east_asian_width', True) In [39]: df;
For further details, see here
Support for openpyxl >= 2.2. The API for style support is now stable (GH10125)
openpyxl
merge now accepts the argument indicator which adds a Categorical-type column (by default called _merge) to the output object that takes on the values (GH8790)
merge
indicator
_merge
Observation Origin
_merge value
Merge key only in 'left' frame
'left'
left_only
Merge key only in 'right' frame
'right'
right_only
Merge key in both frames
both
In [40]: df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']}) In [41]: df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]}) In [42]: pd.merge(df1, df2, on='col1', how='outer', indicator=True) Out[42]: col1 col_left col_right _merge 0 0 a NaN left_only 1 1 b 2.0 both 2 2 NaN 2.0 right_only 3 2 NaN 2.0 right_only [4 rows x 4 columns]
For more, see the updated docs
pd.to_numeric is a new function to coerce strings to numbers (possibly with coercion) (GH11133)
pd.to_numeric
pd.merge will now allow duplicate column names if they are not merged upon (GH10639).
pd.merge
pd.pivot will now allow passing index as None (GH3962).
pd.pivot
None
pd.concat will now use existing Series names if provided (GH10698).
pd.concat
In [43]: foo = pd.Series([1, 2], name='foo') In [44]: bar = pd.Series([1, 2]) In [45]: baz = pd.Series([4, 5])
In [1]: pd.concat([foo, bar, baz], 1) Out[1]: 0 1 2 0 1 1 4 1 2 2 5
In [46]: pd.concat([foo, bar, baz], 1) Out[46]: foo 0 1 0 1 1 4 1 2 2 5 [2 rows x 3 columns]
DataFrame has gained the nlargest and nsmallest methods (GH10393)
nlargest
Add a limit_direction keyword argument that works with limit to enable interpolate to fill NaN values forward, backward, or both (GH9218, GH10420, GH11115)
limit_direction
limit
interpolate
In [47]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13]) In [48]: ser.interpolate(limit=1, limit_direction='both') Out[48]: 0 NaN 1 5.0 2 5.0 3 7.0 4 NaN 5 11.0 6 13.0 Length: 7, dtype: float64
Added a DataFrame.round method to round the values to a variable number of decimal places (GH10568).
DataFrame.round
In [49]: df = pd.DataFrame(np.random.random([3, 3]), ....: columns=['A', 'B', 'C'], ....: index=['first', 'second', 'third']) ....: In [50]: df Out[50]: A B C first 0.126970 0.966718 0.260476 second 0.897237 0.376750 0.336222 third 0.451376 0.840255 0.123102 [3 rows x 3 columns] In [51]: df.round(2) Out[51]: A B C first 0.13 0.97 0.26 second 0.90 0.38 0.34 third 0.45 0.84 0.12 [3 rows x 3 columns] In [52]: df.round({'A': 0, 'C': 2}) Out[52]: A B C first 0.0 0.966718 0.26 second 1.0 0.376750 0.34 third 0.0 0.840255 0.12 [3 rows x 3 columns]
drop_duplicates and duplicated now accept a keep keyword to target first, last, and all duplicates. The take_last keyword is deprecated, see here (GH6511, GH8505)
drop_duplicates
duplicated
keep
take_last
In [53]: s = pd.Series(['A', 'B', 'C', 'A', 'B', 'D']) In [54]: s.drop_duplicates() Out[54]: 0 A 1 B 2 C 5 D Length: 4, dtype: object In [55]: s.drop_duplicates(keep='last') Out[55]: 2 C 3 A 4 B 5 D Length: 4, dtype: object In [56]: s.drop_duplicates(keep=False) Out[56]: 2 C 5 D Length: 2, dtype: object
Reindex now has a tolerance argument that allows for finer control of Limits on filling while reindexing (GH10411):
tolerance
In [57]: df = pd.DataFrame({'x': range(5), ....: 't': pd.date_range('2000-01-01', periods=5)}) ....: In [58]: df.reindex([0.1, 1.9, 3.5], ....: method='nearest', ....: tolerance=0.2) ....: Out[58]: x t 0.1 0.0 2000-01-01 1.9 2.0 2000-01-03 3.5 NaN NaT [3 rows x 2 columns]
When used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into a Timedelta if possible. This allows you to specify tolerance with a string:
TimedeltaIndex
Timedelta
In [59]: df = df.set_index('t') In [60]: df.reindex(pd.to_datetime(['1999-12-31']), ....: method='nearest', ....: tolerance='1 day') ....: Out[60]: x 1999-12-31 0 [1 rows x 1 columns]
tolerance is also exposed by the lower level Index.get_indexer and Index.get_loc methods.
Index.get_indexer
Index.get_loc
Added functionality to use the base argument when resampling a TimeDeltaIndex (GH10530)
base
TimeDeltaIndex
DatetimeIndex can be instantiated using strings contains NaT (GH7599)
NaT
to_datetime can now accept the yearfirst keyword (GH7599)
yearfirst
pandas.tseries.offsets larger than the Day offset can now be used with a Series for addition/subtraction (GH10699). See the docs for more details.
pandas.tseries.offsets
Day
pd.Timedelta.total_seconds() now returns Timedelta duration to ns precision (previously microsecond precision) (GH10939)
pd.Timedelta.total_seconds()
PeriodIndex now supports arithmetic with np.ndarray (GH10638)
np.ndarray
Support pickling of Period objects (GH10439)
.as_blocks will now take a copy optional argument to return a copy of the data, default is to copy (no change in behavior from prior versions), (GH9607)
.as_blocks
copy
regex argument to DataFrame.filter now handles numeric column names instead of raising ValueError (GH10384).
regex
DataFrame.filter
Enable reading gzip compressed files via URL, either by explicitly setting the compression parameter or by inferring from the presence of the HTTP Content-Encoding header in the response (GH8685)
Enable writing Excel files in memory using StringIO/BytesIO (GH7074)
Enable serialization of lists and dicts to strings in ExcelWriter (GH8188)
ExcelWriter
SQL io functions now accept a SQLAlchemy connectable. (GH7877)
pd.read_sql and to_sql can accept database URI as con parameter (GH10214)
pd.read_sql
to_sql
con
read_sql_table will now allow reading from views (GH10750).
read_sql_table
Enable writing complex values to HDFStores when using the table format (GH10447)
HDFStores
table
Enable pd.read_hdf to be used without specifying a key when the HDF file contains a single dataset (GH10443)
pd.read_hdf
pd.read_stata will now read Stata 118 type files. (GH9882)
pd.read_stata
msgpack submodule has been updated to 0.4.6 with backward compatibility (GH10581)
msgpack
DataFrame.to_dict now accepts orient='index' keyword argument (GH10844).
DataFrame.to_dict
orient='index'
DataFrame.apply will return a Series of dicts if the passed function returns a dict and reduce=True (GH8735).
DataFrame.apply
reduce=True
Allow passing kwargs to the interpolation methods (GH10378).
Improved error message when concatenating an empty iterable of Dataframe objects (GH9157)
Dataframe
pd.read_csv can now read bz2-compressed files incrementally, and the C parser can read bz2-compressed files from AWS S3 (GH11070, GH11072).
pd.read_csv
In pd.read_csv, recognize s3n:// and s3a:// URLs as designating S3 file storage (GH11070, GH11071).
s3n://
s3a://
Read CSV files from AWS S3 incrementally, instead of first downloading the entire file. (Full file download still required for compressed files in Python 2.) (GH11070, GH11073)
pd.read_csv is now able to infer compression type for files read from AWS S3 storage (GH11070, GH11074).
The sorting API has had some longtime inconsistencies. (GH9816, GH8239).
Here is a summary of the API PRIOR to 0.17.0:
Series.sort is INPLACE while DataFrame.sort returns a new object.
Series.sort
DataFrame.sort
Series.order returns a new object
Series.order
It was possible to use Series/DataFrame.sort_index to sort by values by passing the by keyword.
Series/DataFrame.sort_index
by
Series/DataFrame.sortlevel worked only on a MultiIndex for sorting by index.
Series/DataFrame.sortlevel
To address these issues, we have revamped the API:
We have introduced a new method, DataFrame.sort_values(), which is the merger of DataFrame.sort(), Series.sort(), and Series.order(), to handle sorting of values.
DataFrame.sort_values()
DataFrame.sort()
Series.sort()
Series.order()
The existing methods Series.sort(), Series.order(), and DataFrame.sort() have been deprecated and will be removed in a future version.
The by argument of DataFrame.sort_index() has been deprecated and will be removed in a future version.
DataFrame.sort_index()
The existing method .sort_index() will gain the level keyword to enable level sorting.
.sort_index()
level
We now have two distinct and non-overlapping methods of sorting. A * marks items that will show a FutureWarning.
*
FutureWarning
To sort by the values:
Previous
Replacement
* Series.order()
Series.sort_values()
* Series.sort()
Series.sort_values(inplace=True)
* DataFrame.sort(columns=...)
DataFrame.sort(columns=...)
DataFrame.sort_values(by=...)
To sort by the index:
Series.sort_index()
Series.sortlevel(level=...)
Series.sort_index(level=...)
Series.sort_index(level=...
DataFrame.sortlevel(level=...)
DataFrame.sort_index(level=...)
* DataFrame.sort()
We have also deprecated and changed similar methods in two Series-like classes, Index and Categorical.
Index
Categorical
* Index.order()
Index.order()
Index.sort_values()
* Categorical.order()
Categorical.order()
Categorical.sort_values()
The default for pd.to_datetime error handling has changed to errors='raise'. In prior versions it was errors='ignore'. Furthermore, the coerce argument has been deprecated in favor of errors='coerce'. This means that invalid parsing will raise rather that return the original input as in previous versions. (GH10636)
pd.to_datetime
errors='raise'
errors='ignore'
coerce
errors='coerce'
In [2]: pd.to_datetime(['2009-07-31', 'asd']) Out[2]: array(['2009-07-31', 'asd'], dtype=object)
In [3]: pd.to_datetime(['2009-07-31', 'asd']) ValueError: Unknown string format
Of course you can coerce this as well.
In [61]: pd.to_datetime(['2009-07-31', 'asd'], errors='coerce') Out[61]: DatetimeIndex(['2009-07-31', 'NaT'], dtype='datetime64[ns]', freq=None)
To keep the previous behavior, you can use errors='ignore':
In [62]: pd.to_datetime(['2009-07-31', 'asd'], errors='ignore') Out[62]: Index(['2009-07-31', 'asd'], dtype='object')
Furthermore, pd.to_timedelta has gained a similar API, of errors='raise'|'ignore'|'coerce', and the coerce keyword has been deprecated in favor of errors='coerce'.
pd.to_timedelta
errors='raise'|'ignore'|'coerce'
The string parsing of to_datetime, Timestamp and DatetimeIndex has been made consistent. (GH7599)
Timestamp
Prior to v0.17.0, Timestamp and to_datetime may parse year-only datetime-string incorrectly using today’s date, otherwise DatetimeIndex uses the beginning of the year. Timestamp and to_datetime may raise ValueError in some types of datetime-string which DatetimeIndex can parse, such as a quarterly string.
In [1]: pd.Timestamp('2012Q2') Traceback ... ValueError: Unable to parse 2012Q2 # Results in today's date. In [2]: pd.Timestamp('2014') Out [2]: 2014-08-12 00:00:00
v0.17.0 can parse them as below. It works on DatetimeIndex also.
In [63]: pd.Timestamp('2012Q2') Out[63]: Timestamp('2012-04-01 00:00:00') In [64]: pd.Timestamp('2014') Out[64]: Timestamp('2014-01-01 00:00:00') In [65]: pd.DatetimeIndex(['2012Q2', '2014']) Out[65]: DatetimeIndex(['2012-04-01', '2014-01-01'], dtype='datetime64[ns]', freq=None)
If you want to perform calculations based on today’s date, use Timestamp.now() and pandas.tseries.offsets.
Timestamp.now()
In [66]: import pandas.tseries.offsets as offsets In [67]: pd.Timestamp.now() Out[67]: Timestamp('2020-01-20 11:41:48.607119') In [68]: pd.Timestamp.now() + offsets.DateOffset(years=1) Out[68]: Timestamp('2021-01-20 11:41:48.607968')
Operator equal on Index should behavior similarly to Series (GH9947, GH10637)
Starting in v0.17.0, comparing Index objects of different lengths will raise a ValueError. This is to be consistent with the behavior of Series.
In [2]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5]) Out[2]: array([ True, False, False], dtype=bool) In [3]: pd.Index([1, 2, 3]) == pd.Index([2]) Out[3]: array([False, True, False], dtype=bool) In [4]: pd.Index([1, 2, 3]) == pd.Index([1, 2]) Out[4]: False
In [8]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5]) Out[8]: array([ True, False, False], dtype=bool) In [9]: pd.Index([1, 2, 3]) == pd.Index([2]) ValueError: Lengths must match to compare In [10]: pd.Index([1, 2, 3]) == pd.Index([1, 2]) ValueError: Lengths must match to compare
Note that this is different from the numpy behavior where a comparison can be broadcast:
numpy
In [69]: np.array([1, 2, 3]) == np.array([1]) Out[69]: array([ True, False, False])
or it can return False if broadcasting can not be done:
In [70]: np.array([1, 2, 3]) == np.array([1, 2]) Out[70]: False
Boolean comparisons of a Series vs None will now be equivalent to comparing with np.nan, rather than raise TypeError. (GH1079).
np.nan
TypeError
In [71]: s = pd.Series(range(3)) In [72]: s.iloc[1] = None In [73]: s Out[73]: 0 0.0 1 NaN 2 2.0 Length: 3, dtype: float64
In [5]: s == None TypeError: Could not compare <type 'NoneType'> type with Series
In [74]: s == None Out[74]: 0 False 1 False 2 False Length: 3, dtype: bool
Usually you simply want to know which values are null.
In [75]: s.isnull() Out[75]: 0 False 1 True 2 False Length: 3, dtype: bool
You generally will want to use isnull/notnull for these types of comparisons, as isnull/notnull tells you which elements are null. One has to be mindful that nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.
isnull/notnull
nan's
None's
np.nan != np.nan
In [76]: None == None Out[76]: True In [77]: np.nan == np.nan Out[77]: False
The default behavior for HDFStore write functions with format='table' is now to keep rows that are all missing. Previously, the behavior was to drop rows that were all missing save the index. The previous behavior can be replicated using the dropna=True option. (GH9382)
format='table'
dropna=True
In [78]: df_with_missing = pd.DataFrame({'col1': [0, np.nan, 2], ....: 'col2': [1, np.nan, np.nan]}) ....: In [79]: df_with_missing Out[79]: col1 col2 0 0.0 1.0 1 NaN NaN 2 2.0 NaN [3 rows x 2 columns]
In [27]: df_with_missing.to_hdf('file.h5', 'df_with_missing', format='table', mode='w') In [28]: pd.read_hdf('file.h5', 'df_with_missing') Out [28]: col1 col2 0 0 1 2 2 NaN
In [80]: df_with_missing.to_hdf('file.h5', ....: 'df_with_missing', ....: format='table', ....: mode='w') ....: --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-80-965769e1a1ad> in <module> 2 'df_with_missing', 3 format='table', ----> 4 mode='w') ~/scipy/pandas/pandas/core/generic.py in to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding) 2468 data_columns=data_columns, 2469 errors=errors, -> 2470 encoding=encoding, 2471 ) 2472 ~/scipy/pandas/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding) 278 if isinstance(path_or_buf, str): 279 with HDFStore( --> 280 path_or_buf, mode=mode, complevel=complevel, complib=complib 281 ) as store: 282 f(store) ~/scipy/pandas/pandas/io/pytables.py in __init__(self, path, mode, complevel, complib, fletcher32, **kwargs) 516 raise ValueError("format is not a defined argument for HDFStore") 517 --> 518 tables = import_optional_dependency("tables") 519 520 if complib is not None and complib not in tables.filters.all_complibs: ~/scipy/pandas/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version) 89 except ImportError: 90 if raise_on_missing: ---> 91 raise ImportError(msg) from None 92 else: 93 return None ImportError: Missing optional dependency 'tables'. Use pip or conda to install tables. In [81]: pd.read_hdf('file.h5', 'df_with_missing') --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-81-59c725e9cc6c> in <module> ----> 1 pd.read_hdf('file.h5', 'df_with_missing') ~/scipy/pandas/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, errors, where, start, stop, columns, iterator, chunksize, **kwargs) 393 394 if not exists: --> 395 raise FileNotFoundError(f"File {path_or_buf} does not exist") 396 397 store = HDFStore(path_or_buf, mode=mode, errors=errors, **kwargs) FileNotFoundError: File file.h5 does not exist
The display.precision option has been clarified to refer to decimal places (GH10451).
Earlier versions of pandas would format floating point numbers to have one less decimal place than the value in display.precision.
In [1]: pd.set_option('display.precision', 2) In [2]: pd.DataFrame({'x': [123.456789]}) Out[2]: x 0 123.5
If interpreting precision as “significant figures” this did work for scientific notation but that same interpretation did not work for values with standard formatting. It was also out of step with how numpy handles formatting.
Going forward the value of display.precision will directly control the number of places after the decimal, for regular formatting as well as scientific notation, similar to how numpy’s precision print option works.
precision
In [82]: pd.set_option('display.precision', 2) In [83]: pd.DataFrame({'x': [123.456789]}) Out[83]: x 0 123.46 [1 rows x 1 columns]
To preserve output behavior with prior versions the default value of display.precision has been reduced to 6 from 7.
6
7
Categorical.unique now returns new Categoricals with categories and codes that are unique, rather than returning np.array (GH10508)
Categoricals
categories
codes
np.array
unordered category: values and categories are sorted by appearance order.
ordered category: values are sorted by appearance order, categories keep existing order.
In [84]: cat = pd.Categorical(['C', 'A', 'B', 'C'], ....: categories=['A', 'B', 'C'], ....: ordered=True) ....: In [85]: cat Out[85]: [C, A, B, C] Categories (3, object): [A < B < C] In [86]: cat.unique() Out[86]: [C, A, B] Categories (3, object): [A < B < C] In [87]: cat = pd.Categorical(['C', 'A', 'B', 'C'], ....: categories=['A', 'B', 'C']) ....: In [88]: cat Out[88]: [C, A, B, C] Categories (3, object): [A, B, C] In [89]: cat.unique() Out[89]: [C, A, B] Categories (3, object): [C, A, B]
In earlier versions of pandas, if a bool was passed the header argument of read_csv, read_excel, or read_html it was implicitly converted to an integer, resulting in header=0 for False and header=1 for True (GH6113)
read_csv
read_html
header=0
header=1
A bool input to header will now raise a TypeError
In [29]: df = pd.read_csv('data.csv', header=False) TypeError: Passing a bool to header is invalid. Use header=None for no header or header=int or list-like of ints to specify the row(s) making up the column names
Line and kde plot with subplots=True now uses default colors, not all black. Specify color='k' to draw all lines in black (GH9894)
subplots=True
color='k'
Calling the .value_counts() method on a Series with a categorical dtype now returns a Series with a CategoricalIndex (GH10704)
.value_counts()
categorical
CategoricalIndex
The metadata properties of subclasses of pandas objects will now be serialized (GH10553).
groupby using Categorical follows the same rule as Categorical.unique described above (GH10508)
When constructing DataFrame with an array of complex64 dtype previously meant the corresponding column was automatically promoted to the complex128 dtype. Pandas will now preserve the itemsize of the input for complex data (GH10952)
complex64
complex128
some numeric reduction operators would return ValueError, rather than TypeError on object types that includes strings and numbers (GH11131)
Passing currently unsupported chunksize argument to read_excel or ExcelFile.parse will now raise NotImplementedError (GH8011)
chunksize
ExcelFile.parse
NotImplementedError
Allow an ExcelFile object to be passed into read_excel (GH11198)
ExcelFile
DatetimeIndex.union does not infer freq if self and the input have None as freq (GH11086)
DatetimeIndex.union
freq
self
NaT’s methods now either raise ValueError, or return np.nan or NaT (GH9513)
Behavior
Methods
return np.nan
weekday, isoweekday
weekday
isoweekday
return NaT
date, now, replace, to_datetime, today
date
now
replace
today
return np.datetime64('NaT')
np.datetime64('NaT')
to_datetime64 (unchanged)
to_datetime64
raise ValueError
All other public methods (names not beginning with underscores)
For Series the following indexing functions are deprecated (GH10177).
Deprecated Function
.irow(i)
.iloc[i] or .iat[i]
.iloc[i]
.iat[i]
.iget(i)
.iget_value(i)
For DataFrame the following indexing functions are deprecated (GH10177).
.iget_value(i, j)
.iloc[i, j] or .iat[i, j]
.iloc[i, j]
.iat[i, j]
.icol(j)
.iloc[:, j]
These indexing function have been deprecated in the documentation since 0.11.0.
Categorical.name was deprecated to make Categorical more numpy.ndarray like. Use Series(cat, name="whatever") instead (GH10482).
Categorical.name
numpy.ndarray
Series(cat, name="whatever")
Setting missing values (NaN) in a Categorical’s categories will issue a warning (GH10748). You can still have missing values in the values.
values
drop_duplicates and duplicated’s take_last keyword was deprecated in favor of keep. (GH6511, GH8505)
Series.nsmallest and nlargest’s take_last keyword was deprecated in favor of keep. (GH10792)
Series.nsmallest
DataFrame.combineAdd and DataFrame.combineMult are deprecated. They can easily be replaced by using the add and mul methods: DataFrame.add(other, fill_value=0) and DataFrame.mul(other, fill_value=1.) (GH10735).
DataFrame.combineAdd
DataFrame.combineMult
add
mul
DataFrame.add(other, fill_value=0)
DataFrame.mul(other, fill_value=1.)
TimeSeries deprecated in favor of Series (note that this has been an alias since 0.13.0), (GH10890)
TimeSeries
SparsePanel deprecated and will be removed in a future version (GH11157).
SparsePanel
Series.is_time_series deprecated in favor of Series.index.is_all_dates (GH11135)
Series.is_time_series
Series.index.is_all_dates
Legacy offsets (like 'A@JAN') are deprecated (note that this has been alias since 0.8.0) (GH10878)
'A@JAN'
WidePanel deprecated in favor of Panel, LongPanel in favor of DataFrame (note these have been aliases since < 0.11.0), (GH10892)
WidePanel
Panel
LongPanel
DataFrame.convert_objects has been deprecated in favor of type-specific functions pd.to_datetime, pd.to_timestamp and pd.to_numeric (new in 0.17.0) (GH11133).
DataFrame.convert_objects
pd.to_timestamp
Removal of na_last parameters from Series.order() and Series.sort(), in favor of na_position. (GH5231)
na_last
na_position
Remove of percentile_width from .describe(), in favor of percentiles. (GH7088)
percentile_width
.describe()
percentiles
Removal of colSpace parameter from DataFrame.to_string(), in favor of col_space, circa 0.8.0 version.
colSpace
DataFrame.to_string()
col_space
Removal of automatic time-series broadcasting (GH2304)
In [90]: np.random.seed(1234) In [91]: df = pd.DataFrame(np.random.randn(5, 2), ....: columns=list('AB'), ....: index=pd.date_range('2013-01-01', periods=5)) ....: In [92]: df Out[92]: A B 2013-01-01 0.471435 -1.190976 2013-01-02 1.432707 -0.312652 2013-01-03 -0.720589 0.887163 2013-01-04 0.859588 -0.636524 2013-01-05 0.015696 -2.242685 [5 rows x 2 columns]
Previously
In [3]: df + df.A FutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated. Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the index Out[3]: A B 2013-01-01 0.942870 -0.719541 2013-01-02 2.865414 1.120055 2013-01-03 -1.441177 0.166574 2013-01-04 1.719177 0.223065 2013-01-05 0.031393 -2.226989
Current
In [93]: df.add(df.A, axis='index') Out[93]: A B 2013-01-01 0.942870 -0.719541 2013-01-02 2.865414 1.120055 2013-01-03 -1.441177 0.166574 2013-01-04 1.719177 0.223065 2013-01-05 0.031393 -2.226989 [5 rows x 2 columns]
Remove table keyword in HDFStore.put/append, in favor of using format= (GH4645)
HDFStore.put/append
format=
Remove kind in read_excel/ExcelFile as its unused (GH4712)
read_excel/ExcelFile
Remove infer_type keyword from pd.read_html as its unused (GH4770, GH7032)
infer_type
pd.read_html
Remove offset and timeRule keywords from Series.tshift/shift, in favor of freq (GH4853, GH4864)
offset
timeRule
Series.tshift/shift
Remove pd.load/pd.save aliases in favor of pd.to_pickle/pd.read_pickle (GH3787)
pd.load/pd.save
pd.to_pickle/pd.read_pickle
Added vbench benchmarks for alternative ExcelWriter engines and reading Excel files (GH7171)
Performance improvements in Categorical.value_counts (GH10804)
Categorical.value_counts
Performance improvements in SeriesGroupBy.nunique and SeriesGroupBy.value_counts and SeriesGroupby.transform (GH10820, GH11077)
SeriesGroupBy.nunique
SeriesGroupBy.value_counts
SeriesGroupby.transform
Performance improvements in DataFrame.drop_duplicates with integer dtypes (GH10917)
DataFrame.drop_duplicates
Performance improvements in DataFrame.duplicated with wide frames. (GH10161, GH11180)
DataFrame.duplicated
4x improvement in timedelta string parsing (GH6755, GH10426)
timedelta
8x improvement in timedelta64 and datetime64 ops (GH6755)
datetime64
Significantly improved performance of indexing MultiIndex with slicers (GH10287)
8x improvement in iloc using list-like input (GH10791)
iloc
Improved performance of Series.isin for datetimelike/integer Series (GH10287)
Series.isin
20x improvement in concat of Categoricals when categories are identical (GH10587)
concat
Improved performance of to_datetime when specified format string is ISO8601 (GH10178)
2x improvement of Series.value_counts for float dtype (GH10821)
Series.value_counts
Enable infer_datetime_format in to_datetime when date components do not have 0 padding (GH11142)
infer_datetime_format
Regression from 0.16.1 in constructing DataFrame from nested dictionary (GH11084)
Performance improvements in addition/subtraction operations for DateOffset with Series or DatetimeIndex (GH10744, GH11205)
Bug in incorrect computation of .mean() on timedelta64[ns] because of overflow (GH9442)
.mean()
timedelta64[ns]
Bug in .isin on older numpies (GH11232)
.isin
Bug in DataFrame.to_html(index=False) renders unnecessary name row (GH10344)
DataFrame.to_html(index=False)
name
Bug in DataFrame.to_latex() the column_format argument could not be passed (GH9402)
DataFrame.to_latex()
column_format
Bug in DatetimeIndex when localizing with NaT (GH10477)
Bug in Series.dt ops in preserving meta-data (GH10477)
Series.dt
Bug in preserving NaT when passed in an otherwise invalid to_datetime construction (GH10477)
Bug in DataFrame.apply when function returns categorical series. (GH9573)
Bug in to_datetime with invalid dates and formats supplied (GH10154)
Bug in Index.drop_duplicates dropping name(s) (GH10115)
Index.drop_duplicates
Bug in Series.quantile dropping name (GH10881)
Series.quantile
Bug in pd.Series when setting a value on an empty Series whose index has a frequency. (GH10193)
Bug in pd.Series.interpolate with invalid order keyword values. (GH10633)
pd.Series.interpolate
order
Bug in DataFrame.plot raises ValueError when color name is specified by multiple characters (GH10387)
DataFrame.plot
Bug in Index construction with a mixed list of tuples (GH10697)
Bug in DataFrame.reset_index when index contains NaT. (GH10388)
DataFrame.reset_index
Bug in ExcelReader when worksheet is empty (GH6403)
ExcelReader
Bug in BinGrouper.group_info where returned values are not compatible with base class (GH10914)
BinGrouper.group_info
Bug in clearing the cache on DataFrame.pop and a subsequent inplace op (GH10912)
DataFrame.pop
Bug in indexing with a mixed-integer Index causing an ImportError (GH10610)
ImportError
Bug in Series.count when index has nulls (GH10946)
Series.count
Bug in pickling of a non-regular freq DatetimeIndex (GH11002)
Bug causing DataFrame.where to not respect the axis parameter when the frame has a symmetric shape. (GH9736)
DataFrame.where
axis
Bug in Table.select_column where name is not preserved (GH10392)
Table.select_column
Bug in offsets.generate_range where start and end have finer precision than offset (GH9907)
offsets.generate_range
start
end
Bug in pd.rolling_* where Series.name would be lost in the output (GH10565)
pd.rolling_*
Series.name
Bug in stack when index or columns are not unique. (GH10417)
stack
Bug in setting a Panel when an axis has a MultiIndex (GH10360)
Bug in USFederalHolidayCalendar where USMemorialDay and USMartinLutherKingJr were incorrect (GH10278 and GH9760 )
USFederalHolidayCalendar
USMemorialDay
USMartinLutherKingJr
Bug in .sample() where returned object, if set, gives unnecessary SettingWithCopyWarning (GH10738)
.sample()
SettingWithCopyWarning
Bug in .sample() where weights passed as Series were not aligned along axis before being treated positionally, potentially causing problems if weight indices were not aligned with sampled object. (GH10738)
Regression fixed in (GH9311, GH6620, GH9345), where groupby with a datetime-like converting to float with certain aggregators (GH10979)
Bug in DataFrame.interpolate with axis=1 and inplace=True (GH10395)
DataFrame.interpolate
axis=1
inplace=True
Bug in io.sql.get_schema when specifying multiple columns as primary key (GH10385).
io.sql.get_schema
Bug in groupby(sort=False) with datetime-like Categorical raises ValueError (GH10505)
groupby(sort=False)
Bug in groupby(axis=1) with filter() throws IndexError (GH11041)
groupby(axis=1)
filter()
IndexError
Bug in test_categorical on big-endian builds (GH10425)
test_categorical
Bug in Series.shift and DataFrame.shift not supporting categorical data (GH9416)
Series.shift
DataFrame.shift
Bug in Series.map using categorical Series raises AttributeError (GH10324)
Series.map
AttributeError
Bug in MultiIndex.get_level_values including Categorical raises AttributeError (GH10460)
MultiIndex.get_level_values
Bug in pd.get_dummies with sparse=True not returning SparseDataFrame (GH10531)
pd.get_dummies
sparse=True
SparseDataFrame
Bug in Index subtypes (such as PeriodIndex) not returning their own type for .drop and .insert methods (GH10620)
.drop
.insert
Bug in algos.outer_join_indexer when right array is empty (GH10618)
algos.outer_join_indexer
right
Bug in filter (regression from 0.16.0) and transform when grouping on multiple keys, one of which is datetime-like (GH10114)
filter
transform
Bug in to_datetime and to_timedelta causing Index name to be lost (GH10875)
to_timedelta
Bug in len(DataFrame.groupby) causing IndexError when there’s a column containing only NaNs (GH11016)
len(DataFrame.groupby)
Bug that caused segfault when resampling an empty Series (GH10228)
Bug in DatetimeIndex and PeriodIndex.value_counts resets name from its result, but retains in result’s Index. (GH10150)
PeriodIndex.value_counts
Bug in pd.eval using numexpr engine coerces 1 element numpy array to scalar (GH10546)
pd.eval
numexpr
Bug in pd.concat with axis=0 when column is of dtype category (GH10177)
axis=0
category
Bug in read_msgpack where input type is not always checked (GH10369, GH10630)
read_msgpack
Bug in pd.read_csv with kwargs index_col=False, index_col=['a', 'b'] or dtype (GH10413, GH10467, GH10577)
index_col=False
index_col=['a', 'b']
dtype
Bug in Series.from_csv with header kwarg not setting the Series.name or the Series.index.name (GH10483)
Series.from_csv
Series.index.name
Bug in groupby.var which caused variance to be inaccurate for small float values (GH10448)
groupby.var
Bug in Series.plot(kind='hist') Y Label not informative (GH10485)
Series.plot(kind='hist')
Bug in read_csv when using a converter which generates a uint8 type (GH9266)
uint8
Bug causes memory leak in time-series line and area plot (GH9003)
Bug when setting a Panel sliced along the major or minor axes when the right-hand side is a DataFrame (GH11014)
Bug that returns None and does not raise NotImplementedError when operator functions (e.g. .add) of Panel are not implemented (GH7692)
.add
Bug in line and kde plot cannot accept multiple colors when subplots=True (GH9894)
Bug in left and right align of Series with MultiIndex may be inverted (GH10665)
align
Bug in left and right join of with MultiIndex may be inverted (GH10741)
join
Bug in read_stata when reading a file with a different order set in columns (GH10757)
read_stata
columns
Bug in Categorical may not representing properly when category contains tz or Period (GH10713)
tz
Bug in Categorical.__iter__ may not returning correct datetime and Period (GH10713)
Categorical.__iter__
datetime
Bug in indexing with a PeriodIndex on an object with a PeriodIndex (GH4125)
Bug in read_csv with engine='c': EOF preceded by a comment, blank line, etc. was not handled correctly (GH10728, GH10548)
engine='c'
Reading “famafrench” data via DataReader results in HTTP 404 error because of the website url is changed (GH10591).
DataReader
Bug in read_msgpack where DataFrame to decode has duplicate column names (GH9618)
Bug in io.common.get_filepath_or_buffer which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (GH10604)
io.common.get_filepath_or_buffer
Bug in vectorised setting of timestamp columns with python datetime.date and numpy datetime64 (GH10408, GH10412)
datetime.date
Bug in Index.take may add unnecessary freq attribute (GH10791)
Index.take
Bug in merge with empty DataFrame may raise IndexError (GH10824)
Bug in to_latex where unexpected keyword argument for some documented arguments (GH10888)
to_latex
Bug in indexing of large DataFrame where IndexError is uncaught (GH10645 and GH10692)
Bug in read_csv when using the nrows or chunksize parameters if file contains only a header line (GH9535)
nrows
Bug in serialization of category types in HDF5 in presence of alternate encodings. (GH10366)
Bug in pd.DataFrame when constructing an empty DataFrame with a string dtype (GH9428)
pd.DataFrame
Bug in pd.DataFrame.diff when DataFrame is not consolidated (GH10907)
pd.DataFrame.diff
Bug in pd.unique for arrays with the datetime64 or timedelta64 dtype that meant an array with object dtype was returned instead the original dtype (GH9431)
pd.unique
Bug in Timedelta raising error when slicing from 0s (GH10583)
Bug in DatetimeIndex.take and TimedeltaIndex.take may not raise IndexError against invalid index (GH10295)
DatetimeIndex.take
TimedeltaIndex.take
Bug in Series([np.nan]).astype('M8[ms]'), which now returns Series([pd.NaT]) (GH10747)
Series([np.nan]).astype('M8[ms]')
Series([pd.NaT])
Bug in PeriodIndex.order reset freq (GH10295)
PeriodIndex.order
Bug in date_range when freq divides end as nanos (GH10885)
date_range
Bug in iloc allowing memory outside bounds of a Series to be accessed with negative integers (GH10779)
Bug in read_msgpack where encoding is not respected (GH10581)
Bug preventing access to the first index when using iloc with a list containing the appropriate negative integer (GH10547, GH10779)
Bug in TimedeltaIndex formatter causing error while trying to save DataFrame with TimedeltaIndex using to_csv (GH10833)
to_csv
Bug in DataFrame.where when handling Series slicing (GH10218, GH9558)
Bug where pd.read_gbq throws ValueError when Bigquery returns zero rows (GH10273)
pd.read_gbq
Bug in to_json which was causing segmentation fault when serializing 0-rank ndarray (GH9576)
to_json
Bug in plotting functions may raise IndexError when plotted on GridSpec (GH10819)
GridSpec
Bug in plot result may show unnecessary minor ticklabels (GH10657)
Bug in groupby incorrect computation for aggregation on DataFrame with NaT (E.g first, last, min). (GH10590, GH11010)
first
last
min
Bug when constructing DataFrame where passing a dictionary with only scalar values and specifying columns did not raise an error (GH10856)
Bug in .var() causing roundoff errors for highly similar values (GH10242)
.var()
Bug in DataFrame.plot(subplots=True) with duplicated columns outputs incorrect result (GH10962)
DataFrame.plot(subplots=True)
Bug in Index arithmetic may result in incorrect class (GH10638)
Bug in date_range results in empty if freq is negative annually, quarterly and monthly (GH11018)
Bug in DatetimeIndex cannot infer negative freq (GH11018)
Remove use of some deprecated numpy comparison operations, mainly in tests. (GH10569)
Bug in Index dtype may not applied properly (GH11017)
Bug in io.gbq when testing for minimum google api client version (GH10652)
io.gbq
Bug in DataFrame construction from nested dict with timedelta keys (GH11129)
dict
Bug in .fillna against may raise TypeError when data contains datetime dtype (GH7095, GH11153)
.fillna
Bug in .groupby when number of keys to group by is same as length of index (GH11185)
.groupby
Bug in convert_objects where converted values might not be returned if all null and coerce (GH9589)
convert_objects
Bug in convert_objects where copy keyword was not respected (GH9589)
A total of 112 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
Alex Rothberg
Andrea Bedini +
Andrew Rosenfeld
Andy Hayden
Andy Li +
Anthonios Partheniou +
Artemy Kolchinsky
Bernard Willers
Charlie Clark +
Chris +
Chris Whelan
Christoph Gohlke +
Christopher Whelan
Clark Fitzgerald
Clearfield Christopher +
Dan Ringwalt +
Daniel Ni +
Data & Code Expert Experimenting with Code on Data +
David Cottrell
David John Gagne +
David Kelly +
ETF +
Eduardo Schettino +
Egor +
Egor Panfilov +
Evan Wright
Frank Pinter +
Gabriel Araujo +
Garrett-R
Gianluca Rossi +
Guillaume Gay
Guillaume Poulin
Harsh Nisar +
Ian Henriksen +
Ian Hoegen +
Jaidev Deshpande +
Jan Rudolph +
Jan Schulz
Jason Swails +
Jeff Reback
Jonas Buyl +
Joris Van den Bossche
Joris Vankerschaver +
Josh Levy-Kramer +
Julien Danjou
Ka Wo Chen
Karrie Kehoe +
Kelsey Jordahl
Kerby Shedden
Kevin Sheppard
Lars Buitinck
Leif Johnson +
Luis Ortiz +
Mac +
Matt Gambogi +
Matt Savoie +
Matthew Gilbert +
Maximilian Roos +
Michelangelo D’Agostino +
Mortada Mehyar
Nick Eubank
Nipun Batra
Ondřej Čertík
Phillip Cloud
Pratap Vardhan +
Rafal Skolasinski +
Richard Lewis +
Rinoc Johnson +
Rob Levy
Robert Gieseke
Safia Abdalla +
Samuel Denny +
Saumitra Shahapure +
Sebastian Pölsterl +
Sebastian Rubbert +
Sheppard, Kevin +
Sinhrks
Siu Kwan Lam +
Skipper Seabold
Spencer Carrucciu +
Stephan Hoyer
Stephen Hoover +
Stephen Pascoe +
Terry Santegoeds +
Thomas Grainger
Tjerk Santegoeds +
Tom Augspurger
Vincent Davis +
Winterflower +
Yaroslav Halchenko
Yuan Tang (Terry) +
agijsberts
ajcr +
behzad nouri
cel4
chris-b1 +
cyrusmaher +
davidovitch +
ganego +
jreback
juricast +
larvian +
maximilianr +
msund +
rekcahpassyla
robertzk +
scls19fr
seth-p
sinhrks
springcoil +
terrytangyuan +
tzinckgraf +