This is a major release from 0.7.3 and includes extensive work on the time series handling and processing infrastructure as well as a great deal of new functionality throughout the library. It includes over 700 commits from more than 20 distinct authors. Most pandas 0.7.3 and earlier users should not experience any issues upgrading, but due to the migration to the NumPy datetime64 dtype, there may be a number of bugs and incompatibilities lurking. Lingering incompatibilities will be fixed ASAP in a 0.8.1 release if necessary. See the full release notes or issue tracker on GitHub for a complete list.
All objects can now work with non-unique indexes. Data alignment / join operations work according to SQL join semantics (including, if application, index duplication in many-to-many joins)
Time series data are now represented using NumPy’s datetime64 dtype; thus, pandas 0.8.0 now requires at least NumPy 1.6. It has been tested and verified to work with the development version (1.7+) of NumPy as well which includes some significant user-facing API changes. NumPy 1.6 also has a number of bugs having to do with nanosecond resolution data, so I recommend that you steer clear of NumPy 1.6’s datetime64 API functions (though limited as they are) and only interact with this data using the interface that pandas provides.
See the end of the 0.8.0 section for a “porting” guide listing potential issues for users migrating legacy code bases from pandas 0.7 or earlier to 0.8.0.
Bug fixes to the 0.7.x series for legacy NumPy < 1.6 users will be provided as they arise. There will be no more further development in 0.7.x beyond bug fixes.
Note
With this release, legacy scikits.timeseries users should be able to port their code to use pandas.
See documentation for overview of pandas timeseries API.
New datetime64 representation speeds up join operations and data alignment, reduces memory usage, and improve serialization / deserialization performance significantly over datetime.datetime
High performance and flexible resample method for converting from high-to-low and low-to-high frequency. Supports interpolation, user-defined aggregation functions, and control over how the intervals and result labeling are defined. A suite of high performance Cython/C-based resampling functions (including Open-High-Low-Close) have also been implemented.
Revamp of frequency aliases and support for frequency shortcuts like ‘15min’, or ‘1h30min’
New DatetimeIndex class supports both fixed frequency and irregular time series. Replaces now deprecated DateRange class
New PeriodIndex and Period classes for representing time spans and performing calendar logic, including the 12 fiscal quarterly frequencies <timeseries.quarterly>. This is a partial port of, and a substantial enhancement to, elements of the scikits.timeseries code base. Support for conversion between PeriodIndex and DatetimeIndex
PeriodIndex
Period
New Timestamp data type subclasses datetime.datetime, providing the same interface while enabling working with nanosecond-resolution data. Also provides easy time zone conversions.
Enhanced support for time zones. Add tz_convert and tz_localize methods to TimeSeries and DataFrame. All timestamps are stored as UTC; Timestamps from DatetimeIndex objects with time zone set will be localized to local time. Time zone conversions are therefore essentially free. User needs to know very little about pytz library now; only time zone names as as strings are required. Time zone-aware timestamps are equal if and only if their UTC timestamps match. Operations between time zone-aware time series with different time zones will result in a UTC-indexed time series.
tz_localize
Time series string indexing conveniences / shortcuts: slice years, year and month, and index values with strings
Enhanced time series plotting; adaptation of scikits.timeseries matplotlib-based plotting code
New date_range, bdate_range, and period_range factory functions
date_range
bdate_range
period_range
Robust frequency inference function infer_freq and inferred_freq property of DatetimeIndex, with option to infer frequency on construction of DatetimeIndex
inferred_freq
to_datetime function efficiently parses array of strings to DatetimeIndex. DatetimeIndex will parse array or list of strings to datetime64
Optimized support for datetime64-dtype data in Series and DataFrame columns
New NaT (Not-a-Time) type to represent NA in timestamp arrays
Optimize Series.asof for looking up “as of” values for arrays of timestamps
Milli, Micro, Nano date offset objects
Can index time series with datetime.time objects to select all data at particular time of day (TimeSeries.at_time) or between two times (TimeSeries.between_time)
TimeSeries.at_time
TimeSeries.between_time
Add tshift method for leading/lagging using the frequency (if any) of the index, as opposed to a naive lead/lag using shift
New cut and qcut functions (like R’s cut function) for computing a categorical variable from a continuous variable by binning values either into value-based (cut) or quantile-based (qcut) bins
qcut
cut
Rename Factor to Categorical and add a number of usability features
Factor
Categorical
Add limit argument to fillna/reindex
More flexible multiple function application in GroupBy, and can pass list (name, function) tuples to get result in particular order with given names
Add flexible replace method for efficiently substituting values
Enhanced read_csv/read_table for reading time series data and converting multiple columns to dates
Add comments option to parser functions: read_csv, etc.
Add dayfirst option to parser functions for parsing international DD/MM/YYYY dates
Allow the user to specify the CSV reader dialect to control quoting etc.
Handling thousands separators in read_csv to improve integer parsing.
Enable unstacking of multiple levels in one shot. Alleviate pivot_table bugs (empty columns being introduced)
pivot_table
Move to klib-based hash tables for indexing; better performance and less memory usage than Python’s dict
Add first, last, min, max, and prod optimized GroupBy functions
New ordered_merge function
Add flexible comparison instance methods eq, ne, lt, gt, etc. to DataFrame, Series
Improve scatter_matrix plotting function and add histogram or kernel density estimates to diagonal
Add ‘kde’ plot option for density plots
Support for converting DataFrame to R data.frame through rpy2
Improved support for complex numbers in Series and DataFrame
Add pct_change method to all data structures
Add max_colwidth configuration option for DataFrame console output
Interpolate Series values using index values
Can select multiple columns from GroupBy
Add update methods to Series/DataFrame for updating values in place
Add any and all method to DataFrame
any
all
import pandas as pd fx = pd.read_pickle('data/fx_prices') import matplotlib.pyplot as plt
Series.plot now supports a secondary_y option:
Series.plot
secondary_y
plt.figure() fx['FR'].plot(style='g') fx['IT'].plot(style='k--', secondary_y=True)
Vytautas Jancauskas, the 2012 GSOC participant, has added many new plot types. For example, 'kde' is a new option:
'kde'
In [1]: s = pd.Series(np.concatenate((np.random.randn(1000), ...: np.random.randn(1000) * 0.5 + 3))) ...: In [2]: plt.figure() Out[2]: <Figure size 640x480 with 0 Axes> In [3]: s.hist(density=True, alpha=0.2) Out[3]: <matplotlib.axes._subplots.AxesSubplot at 0x7f657a7bd4e0> In [4]: s.plot(kind='kde') --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-4-7e9890bbc9b2> in <module> ----> 1 s.plot(kind='kde') ~/scipy/pandas/pandas/plotting/_core.py in __call__(self, *args, **kwargs) 846 data.columns = label_name 847 --> 848 return plot_backend.plot(data, kind=kind, **kwargs) 849 850 def line(self, x=None, y=None, **kwargs): ~/scipy/pandas/pandas/plotting/_matplotlib/__init__.py in plot(data, kind, **kwargs) 59 kwargs["ax"] = getattr(ax, "left_ax", ax) 60 plot_obj = PLOT_CLASSES[kind](data, **kwargs) ---> 61 plot_obj.generate() 62 plot_obj.draw() 63 return plot_obj.result ~/scipy/pandas/pandas/plotting/_matplotlib/core.py in generate(self) 261 self._compute_plot_data() 262 self._setup_subplots() --> 263 self._make_plot() 264 self._add_table() 265 self._make_legend() ~/scipy/pandas/pandas/plotting/_matplotlib/hist.py in _make_plot(self) 78 79 kwds = self._make_plot_keywords(kwds, y) ---> 80 artists = self._plot(ax, y, column_num=i, stacking_id=stacking_id, **kwds) 81 self._add_legend_handle(artists[0], label, index=i) 82 ~/scipy/pandas/pandas/plotting/_matplotlib/hist.py in _plot(cls, ax, y, style, bw_method, ind, column_num, stacking_id, **kwds) 146 **kwds, 147 ): --> 148 from scipy.stats import gaussian_kde 149 150 y = remove_na_arraylike(y) ModuleNotFoundError: No module named 'scipy'
See the plotting page for much more.
Deprecation of offset, time_rule, and timeRule arguments names in time series functions. Warnings will be printed until pandas 0.9 or 1.0.
offset
time_rule
timeRule
The major change that may affect you in pandas 0.8.0 is that time series indexes use NumPy’s datetime64 data type instead of dtype=object arrays of Python’s built-in datetime.datetime objects. DateRange has been replaced by DatetimeIndex but otherwise behaved identically. But, if you have code that converts DateRange or Index objects that used to contain datetime.datetime values to plain NumPy arrays, you may have bugs lurking with code using scalar values because you are handing control over to NumPy:
datetime64
dtype=object
datetime.datetime
DateRange
DatetimeIndex
Index
In [5]: import datetime In [6]: rng = pd.date_range('1/1/2000', periods=10) In [7]: rng[5] Out[7]: Timestamp('2000-01-06 00:00:00', freq='D') In [8]: isinstance(rng[5], datetime.datetime) Out[8]: True In [9]: rng_asarray = np.asarray(rng) In [10]: scalar_val = rng_asarray[5] In [11]: type(scalar_val) Out[11]: numpy.datetime64
pandas’s Timestamp object is a subclass of datetime.datetime that has nanosecond support (the nanosecond field store the nanosecond value between 0 and 999). It should substitute directly into any code that used datetime.datetime values before. Thus, I recommend not casting DatetimeIndex to regular NumPy arrays.
Timestamp
nanosecond
If you have code that requires an array of datetime.datetime objects, you have a couple of options. First, the astype(object) method of DatetimeIndex produces an array of Timestamp objects:
astype(object)
In [12]: stamp_array = rng.astype(object) In [13]: stamp_array Out[13]: Index([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00, 2000-01-07 00:00:00, 2000-01-08 00:00:00, 2000-01-09 00:00:00, 2000-01-10 00:00:00], dtype='object') In [14]: stamp_array[5] Out[14]: Timestamp('2000-01-06 00:00:00', freq='D')
To get an array of proper datetime.datetime objects, use the to_pydatetime method:
to_pydatetime
In [15]: dt_array = rng.to_pydatetime() In [16]: dt_array Out[16]: array([datetime.datetime(2000, 1, 1, 0, 0), datetime.datetime(2000, 1, 2, 0, 0), datetime.datetime(2000, 1, 3, 0, 0), datetime.datetime(2000, 1, 4, 0, 0), datetime.datetime(2000, 1, 5, 0, 0), datetime.datetime(2000, 1, 6, 0, 0), datetime.datetime(2000, 1, 7, 0, 0), datetime.datetime(2000, 1, 8, 0, 0), datetime.datetime(2000, 1, 9, 0, 0), datetime.datetime(2000, 1, 10, 0, 0)], dtype=object) In [17]: dt_array[5] Out[17]: datetime.datetime(2000, 1, 6, 0, 0)
matplotlib knows how to handle datetime.datetime but not Timestamp objects. While I recommend that you plot time series using TimeSeries.plot, you can either use to_pydatetime or register a converter for the Timestamp type. See matplotlib documentation for more on this.
TimeSeries.plot
Warning
There are bugs in the user-facing API with the nanosecond datetime64 unit in NumPy 1.6. In particular, the string version of the array shows garbage values, and conversion to dtype=object is similarly broken.
In [18]: rng = pd.date_range('1/1/2000', periods=10) In [19]: rng Out[19]: DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08', '2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D') In [20]: np.asarray(rng) Out[20]: array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000', '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000', '2000-01-05T00:00:00.000000000', '2000-01-06T00:00:00.000000000', '2000-01-07T00:00:00.000000000', '2000-01-08T00:00:00.000000000', '2000-01-09T00:00:00.000000000', '2000-01-10T00:00:00.000000000'], dtype='datetime64[ns]') In [21]: converted = np.asarray(rng, dtype=object) In [22]: converted[5] Out[22]: Timestamp('2000-01-06 00:00:00', freq='D')
Trust me: don’t panic. If you are using NumPy 1.6 and restrict your interaction with datetime64 values to pandas’s API you will be just fine. There is nothing wrong with the data-type (a 64-bit integer internally); all of the important data processing happens in pandas and is heavily tested. I strongly recommend that you do not work directly with datetime64 arrays in NumPy 1.6 and only use the pandas API.
Support for non-unique indexes: In the latter case, you may have code inside a try:... catch: block that failed due to the index not being unique. In many cases it will no longer fail (some method like append still check for uniqueness unless disabled). However, all is not lost: you can inspect index.is_unique and raise an exception explicitly if it is False or go to a different code branch.
try:... catch:
append
index.is_unique
False
A total of 27 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
Adam Klein
Chang She
David Zaslavsky +
Eric Chlebek +
Jacques Kvam
Kamil Kisiel
Kelsey Jordahl +
Kieran O’Mahony +
Lorenzo Bolla +
Luca Beltrame
Marc Abramowitz +
Mark Wiebe +
Paddy Mullen +
Peng Yu +
Roy Hyunjin Han +
RuiDC +
Senthil Palanisami +
Skipper Seabold
Stefan van der Walt +
Takafumi Arakaki +
Thomas Kluyver
Vytautas Jancauskas +
Wes McKinney
Wouter Overmeire
Yaroslav Halchenko
thuske +
timmie +