This is a bug fix release from 0.9.0 and includes several new features and enhancements along with a large number of bug fixes. The new features include by-column sort order for DataFrame and Series, improved NA handling for the rank method, masking functions for DataFrame, and intraday time-series filtering for DataFrame.
Series.sort, DataFrame.sort, and DataFrame.sort_index can now be specified in a per-column manner to support multiple sort orders (GH928) In [2]: df = pd.DataFrame(np.random.randint(0, 2, (6, 3)), ...: columns=['A', 'B', 'C']) In [3]: df.sort(['A', 'B'], ascending=[1, 0]) Out[3]: A B C 3 0 1 1 4 0 1 1 2 0 0 1 0 1 0 0 1 1 0 0 5 1 0 0 DataFrame.rank now supports additional argument values for the na_option parameter so missing values can be assigned either the largest or the smallest rank (GH1508, GH2159) In [1]: df = pd.DataFrame(np.random.randn(6, 3), columns=['A', 'B', 'C']) In [2]: df.loc[2:4] = np.nan In [3]: df.rank() Out[3]: A B C 0 3.0 2.0 1.0 1 1.0 3.0 2.0 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 2.0 1.0 3.0 [6 rows x 3 columns] In [4]: df.rank(na_option='top') Out[4]: A B C 0 6.0 5.0 4.0 1 4.0 6.0 5.0 2 2.0 2.0 2.0 3 2.0 2.0 2.0 4 2.0 2.0 2.0 5 5.0 4.0 6.0 [6 rows x 3 columns] In [5]: df.rank(na_option='bottom') Out[5]: A B C 0 3.0 2.0 1.0 1 1.0 3.0 2.0 2 5.0 5.0 5.0 3 5.0 5.0 5.0 4 5.0 5.0 5.0 5 2.0 1.0 3.0 [6 rows x 3 columns] DataFrame has new where and mask methods to select values according to a given boolean mask (GH2109, GH2151) DataFrame currently supports slicing via a boolean vector the same length as the DataFrame (inside the []). The returned DataFrame has the same number of columns as the original, but is sliced on its index. In [6]: df = DataFrame(np.random.randn(5, 3), columns = ['A','B','C']) In [7]: df Out[7]: A B C 0 0.276232 -1.087401 -0.673690 1 0.113648 -1.478427 0.524988 2 0.404705 0.577046 -1.715002 3 -1.039268 -0.370647 -1.157892 4 -1.344312 0.844885 1.075770 [5 rows x 3 columns] In [8]: df[df['A'] > 0] Out[8]: A B C 0 0.276232 -1.087401 -0.673690 1 0.113648 -1.478427 0.524988 2 0.404705 0.577046 -1.715002 [3 rows x 3 columns] If a DataFrame is sliced with a DataFrame based boolean condition (with the same size as the original DataFrame), then a DataFrame the same size (index and columns) as the original is returned, with elements that do not meet the boolean condition as NaN. This is accomplished via the new method DataFrame.where. In addition, where takes an optional other argument for replacement. In [9]: df[df>0] Out[9]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns] In [10]: df.where(df>0) Out[10]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns] In [11]: df.where(df>0,-df) Out[11]: A B C 0 0.276232 1.087401 0.673690 1 0.113648 1.478427 0.524988 2 0.404705 0.577046 1.715002 3 1.039268 0.370647 1.157892 4 1.344312 0.844885 1.075770 [5 rows x 3 columns] Furthermore, where now aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to partial setting via .ix (but on the contents rather than the axis labels) In [12]: df2 = df.copy() In [13]: df2[ df2[1:4] > 0 ] = 3 In [14]: df2 Out[14]: A B C 0 0.276232 -1.087401 -0.673690 1 3.000000 -1.478427 3.000000 2 3.000000 3.000000 -1.715002 3 -1.039268 -0.370647 -1.157892 4 -1.344312 0.844885 1.075770 [5 rows x 3 columns] DataFrame.mask is the inverse boolean operation of where. In [15]: df.mask(df<=0) Out[15]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns] Enable referencing of Excel columns by their column names (GH1936) In [16]: xl = pd.ExcelFile('data/test.xls') --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-16-caa1b881a396> in <module> ----> 1 xl = pd.ExcelFile('data/test.xls') ~/scipy/pandas/pandas/io/excel/_base.py in __init__(self, io, engine) 812 self._io = stringify_path(io) 813 --> 814 self._reader = self._engines[engine](self._io) 815 816 def __fspath__(self): ~/scipy/pandas/pandas/io/excel/_xlrd.py in __init__(self, filepath_or_buffer) 18 """ 19 err_msg = "Install xlrd >= 1.0.0 for Excel support" ---> 20 import_optional_dependency("xlrd", extra=err_msg) 21 super().__init__(filepath_or_buffer) 22 ~/scipy/pandas/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version) 89 except ImportError: 90 if raise_on_missing: ---> 91 raise ImportError(msg) from None 92 else: 93 return None ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd. In [17]: xl.parse('Sheet1', index_col=0, parse_dates=True, ....: parse_cols='A:D') ....: --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-17-ab003043f6a9> in <module> ----> 1 xl.parse('Sheet1', index_col=0, parse_dates=True, 2 parse_cols='A:D') NameError: name 'xl' is not defined Added option to disable pandas-style tick locators and formatters using series.plot(x_compat=True) or pandas.plot_params[‘x_compat’] = True (GH2205) Existing TimeSeries methods at_time and between_time were added to DataFrame (GH2149) DataFrame.dot can now accept ndarrays (GH2042) DataFrame.drop now supports non-unique indexes (GH2101) Panel.shift now supports negative periods (GH2164) DataFrame now support unary ~ operator (GH2110)
Series.sort, DataFrame.sort, and DataFrame.sort_index can now be specified in a per-column manner to support multiple sort orders (GH928)
In [2]: df = pd.DataFrame(np.random.randint(0, 2, (6, 3)), ...: columns=['A', 'B', 'C']) In [3]: df.sort(['A', 'B'], ascending=[1, 0]) Out[3]: A B C 3 0 1 1 4 0 1 1 2 0 0 1 0 1 0 0 1 1 0 0 5 1 0 0
DataFrame.rank now supports additional argument values for the na_option parameter so missing values can be assigned either the largest or the smallest rank (GH1508, GH2159)
In [1]: df = pd.DataFrame(np.random.randn(6, 3), columns=['A', 'B', 'C']) In [2]: df.loc[2:4] = np.nan In [3]: df.rank() Out[3]: A B C 0 3.0 2.0 1.0 1 1.0 3.0 2.0 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 2.0 1.0 3.0 [6 rows x 3 columns] In [4]: df.rank(na_option='top') Out[4]: A B C 0 6.0 5.0 4.0 1 4.0 6.0 5.0 2 2.0 2.0 2.0 3 2.0 2.0 2.0 4 2.0 2.0 2.0 5 5.0 4.0 6.0 [6 rows x 3 columns] In [5]: df.rank(na_option='bottom') Out[5]: A B C 0 3.0 2.0 1.0 1 1.0 3.0 2.0 2 5.0 5.0 5.0 3 5.0 5.0 5.0 4 5.0 5.0 5.0 5 2.0 1.0 3.0 [6 rows x 3 columns]
DataFrame has new where and mask methods to select values according to a given boolean mask (GH2109, GH2151)
DataFrame currently supports slicing via a boolean vector the same length as the DataFrame (inside the []). The returned DataFrame has the same number of columns as the original, but is sliced on its index. In [6]: df = DataFrame(np.random.randn(5, 3), columns = ['A','B','C']) In [7]: df Out[7]: A B C 0 0.276232 -1.087401 -0.673690 1 0.113648 -1.478427 0.524988 2 0.404705 0.577046 -1.715002 3 -1.039268 -0.370647 -1.157892 4 -1.344312 0.844885 1.075770 [5 rows x 3 columns] In [8]: df[df['A'] > 0] Out[8]: A B C 0 0.276232 -1.087401 -0.673690 1 0.113648 -1.478427 0.524988 2 0.404705 0.577046 -1.715002 [3 rows x 3 columns] If a DataFrame is sliced with a DataFrame based boolean condition (with the same size as the original DataFrame), then a DataFrame the same size (index and columns) as the original is returned, with elements that do not meet the boolean condition as NaN. This is accomplished via the new method DataFrame.where. In addition, where takes an optional other argument for replacement. In [9]: df[df>0] Out[9]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns] In [10]: df.where(df>0) Out[10]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns] In [11]: df.where(df>0,-df) Out[11]: A B C 0 0.276232 1.087401 0.673690 1 0.113648 1.478427 0.524988 2 0.404705 0.577046 1.715002 3 1.039268 0.370647 1.157892 4 1.344312 0.844885 1.075770 [5 rows x 3 columns] Furthermore, where now aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to partial setting via .ix (but on the contents rather than the axis labels) In [12]: df2 = df.copy() In [13]: df2[ df2[1:4] > 0 ] = 3 In [14]: df2 Out[14]: A B C 0 0.276232 -1.087401 -0.673690 1 3.000000 -1.478427 3.000000 2 3.000000 3.000000 -1.715002 3 -1.039268 -0.370647 -1.157892 4 -1.344312 0.844885 1.075770 [5 rows x 3 columns] DataFrame.mask is the inverse boolean operation of where. In [15]: df.mask(df<=0) Out[15]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns]
DataFrame currently supports slicing via a boolean vector the same length as the DataFrame (inside the []). The returned DataFrame has the same number of columns as the original, but is sliced on its index.
In [6]: df = DataFrame(np.random.randn(5, 3), columns = ['A','B','C']) In [7]: df Out[7]: A B C 0 0.276232 -1.087401 -0.673690 1 0.113648 -1.478427 0.524988 2 0.404705 0.577046 -1.715002 3 -1.039268 -0.370647 -1.157892 4 -1.344312 0.844885 1.075770 [5 rows x 3 columns] In [8]: df[df['A'] > 0] Out[8]: A B C 0 0.276232 -1.087401 -0.673690 1 0.113648 -1.478427 0.524988 2 0.404705 0.577046 -1.715002 [3 rows x 3 columns]
If a DataFrame is sliced with a DataFrame based boolean condition (with the same size as the original DataFrame), then a DataFrame the same size (index and columns) as the original is returned, with elements that do not meet the boolean condition as NaN. This is accomplished via the new method DataFrame.where. In addition, where takes an optional other argument for replacement.
In [9]: df[df>0] Out[9]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns] In [10]: df.where(df>0) Out[10]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns] In [11]: df.where(df>0,-df) Out[11]: A B C 0 0.276232 1.087401 0.673690 1 0.113648 1.478427 0.524988 2 0.404705 0.577046 1.715002 3 1.039268 0.370647 1.157892 4 1.344312 0.844885 1.075770 [5 rows x 3 columns]
Furthermore, where now aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to partial setting via .ix (but on the contents rather than the axis labels)
In [12]: df2 = df.copy() In [13]: df2[ df2[1:4] > 0 ] = 3 In [14]: df2 Out[14]: A B C 0 0.276232 -1.087401 -0.673690 1 3.000000 -1.478427 3.000000 2 3.000000 3.000000 -1.715002 3 -1.039268 -0.370647 -1.157892 4 -1.344312 0.844885 1.075770 [5 rows x 3 columns]
DataFrame.mask is the inverse boolean operation of where.
In [15]: df.mask(df<=0) Out[15]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns]
Enable referencing of Excel columns by their column names (GH1936)
In [16]: xl = pd.ExcelFile('data/test.xls') --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-16-caa1b881a396> in <module> ----> 1 xl = pd.ExcelFile('data/test.xls') ~/scipy/pandas/pandas/io/excel/_base.py in __init__(self, io, engine) 812 self._io = stringify_path(io) 813 --> 814 self._reader = self._engines[engine](self._io) 815 816 def __fspath__(self): ~/scipy/pandas/pandas/io/excel/_xlrd.py in __init__(self, filepath_or_buffer) 18 """ 19 err_msg = "Install xlrd >= 1.0.0 for Excel support" ---> 20 import_optional_dependency("xlrd", extra=err_msg) 21 super().__init__(filepath_or_buffer) 22 ~/scipy/pandas/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version) 89 except ImportError: 90 if raise_on_missing: ---> 91 raise ImportError(msg) from None 92 else: 93 return None ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd. In [17]: xl.parse('Sheet1', index_col=0, parse_dates=True, ....: parse_cols='A:D') ....: --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-17-ab003043f6a9> in <module> ----> 1 xl.parse('Sheet1', index_col=0, parse_dates=True, 2 parse_cols='A:D') NameError: name 'xl' is not defined
Added option to disable pandas-style tick locators and formatters using series.plot(x_compat=True) or pandas.plot_params[‘x_compat’] = True (GH2205)
Existing TimeSeries methods at_time and between_time were added to DataFrame (GH2149)
DataFrame.dot can now accept ndarrays (GH2042)
DataFrame.drop now supports non-unique indexes (GH2101)
Panel.shift now supports negative periods (GH2164)
DataFrame now support unary ~ operator (GH2110)
Upsampling data with a PeriodIndex will result in a higher frequency TimeSeries that spans the original time window In [1]: prng = pd.period_range('2012Q1', periods=2, freq='Q') In [2]: s = pd.Series(np.random.randn(len(prng)), prng) In [4]: s.resample('M') Out[4]: 2012-01 -1.471992 2012-02 NaN 2012-03 NaN 2012-04 -0.493593 2012-05 NaN 2012-06 NaN Freq: M, dtype: float64 Period.end_time now returns the last nanosecond in the time interval (GH2124, GH2125, GH1764) In [18]: p = pd.Period('2012') In [19]: p.end_time Out[19]: Timestamp('2012-12-31 23:59:59.999999999') File parsers no longer coerce to float or bool for columns that have custom converters specified (GH2184) In [20]: import io In [21]: data = ('A,B,C\n' ....: '00001,001,5\n' ....: '00002,002,6') ....: In [22]: pd.read_csv(io.StringIO(data), converters={'A': lambda x: x.strip()}) Out[22]: A B C 0 00001 1 5 1 00002 2 6 [2 rows x 3 columns]
Upsampling data with a PeriodIndex will result in a higher frequency TimeSeries that spans the original time window
In [1]: prng = pd.period_range('2012Q1', periods=2, freq='Q') In [2]: s = pd.Series(np.random.randn(len(prng)), prng) In [4]: s.resample('M') Out[4]: 2012-01 -1.471992 2012-02 NaN 2012-03 NaN 2012-04 -0.493593 2012-05 NaN 2012-06 NaN Freq: M, dtype: float64
Period.end_time now returns the last nanosecond in the time interval (GH2124, GH2125, GH1764)
In [18]: p = pd.Period('2012') In [19]: p.end_time Out[19]: Timestamp('2012-12-31 23:59:59.999999999')
File parsers no longer coerce to float or bool for columns that have custom converters specified (GH2184)
In [20]: import io In [21]: data = ('A,B,C\n' ....: '00001,001,5\n' ....: '00002,002,6') ....: In [22]: pd.read_csv(io.StringIO(data), converters={'A': lambda x: x.strip()}) Out[22]: A B C 0 00001 1 5 1 00002 2 6 [2 rows x 3 columns]
See the full release notes or issue tracker on GitHub for a complete list.
A total of 11 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
Brenda Moon +
Chang She
Jeff Reback +
Justin C Johnson +
K.-Michael Aye
Martin Blais
Tobias Brandt +
Wes McKinney
Wouter Overmeire
timmie
y-p