Pandas - What's new and what's coming

class: center, middle

# Pandas

## What's new and what's coming

Joris Van den Bossche, PyParis, June 12, 2017

https://github.com/jorisvandenbossche/talks/

.affiliations[
  ![:scale 65%](img/logoUPSayPlusCDS_990.png)
  ![:scale 25%](img/inria-logo.png)
]

---
class: center, middle

![:scale 100%](img/pandas_logo.svg)

---

## Pandas: data analysis in python

Targeted at working with **tabular or structured data** (like R dataframe, SQL table, Excel spreadsheet, ...):

- Import and export your data
- Clean up messy data
- Explore data, gain insight into data
- Process and prepare your data for analysis
- Analyse your data (together with scikit-learn, statsmodels, ...)

Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...

Its documentation: http://pandas.pydata.org/pandas-docs/stable/

---

# Pandas 0.20.x

Released beginning of May

Some highlights:

* Deprecation of `ix`
* Deprecation of `Panel`
* Styled output to excel
* JSON table schema
* IntervalIndex
* new `agg` method
* ...

Full release notes: [http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-20-1-may-5-2017](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-20-1-may-5-2017)

---

# Styled output to excel

--
count: false

```python
>>> styled = (df.style
...     .applymap(lambda val: 'color: %s' % 'red' if val < 0 else 'black')
...     .highlight_max())

>>> styled.to_excel('styled.xlsx', engine='openpyxl')
```

.center[
![](img/style-excel.png)
]

---

# JSON Table Schema

### Potential for better, more interactive display of DataFrames

http://specs.frictionlessdata.io/json-table-schema/

```python
pd.options.display.html.table_schema = True
```

---

# JSON Table Schema

.center[
![:scale 80%](img/json_table_schema.gif)
]

https://github.com/gnestor/jupyterlab_table

https://github.com/nteract/nteract/issues/1572

???

Table Schema

A simple format to declare a schema for tabular data. The schema is designed to be expressible in JSON.

---

# New `.agg` DataFrame method

Mimicking the existing `df.groupby(..).agg(...)`

Applying custom set of aggregation functions at once:

```python
>>> df = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B'])
>>> df
          A         B
0  0.740440 -1.081037
1 -1.938700  0.851898
2  1.027494 -0.649469
3  2.461105 -0.171393
```

--
count: false

```python
>>> df.agg(['min', 'mean', 'max'])
             A         B
min  -1.938700 -1.081037
mean  0.572585 -0.262500
max   2.461105  0.851898
```

---

# New `IntervalIndex`

New Index type, with underlying `interval tree` implementation

>  *"It allows one to efficiently find all intervals that overlap with any given interval or point"*
> (Wikipedia on Interval trees)

Think: records that have start and stop values.

--
count: false

```python
>>> index = pd.IntervalIndex.from_arrays(left=[0, 1, 2], right=[10, 5, 3],
...                                      closed='left')
>>> index
IntervalIndex([[0, 10), [1, 5), [2, 3)]
              closed='left',
              dtype='interval[int64]')
```

---

# New `IntervalIndex`

```python
>>> s = pd.Series(list('abc'), index)
>>> s
[0, 10)    a
[1, 5)     b
[2, 3)     c
dtype: object
```

Efficient indexing of non-sorted or overlapping intervals:

```python
>>> s.loc[1.5]
[0, 10)    a
[1, 5)     b
dtype: object
```

Output of `pd.cut`, gridded data, genomics, ...

---

# Deprecation of `Panel`

### 3D Panel has always been underdeveloped

### Focus on tabular data workflows (1D Series, 2D DataFrame)

--
count: false

### Alternatives

* Multi-Index
* `xarray`: N-D labeled arrays and datasets in Python (http://xarray.pydata.org)

http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro-deprecate-panel

---

# Deprecation of `ix`

```python
>>> s1 = pd.Series(range(4), index=range(1, 5))

>>> s2 = pd.Series(range(4), index=list('abcd'))

>>> s1
1    0
2    1
3    2
4    3
dtype: int64

>>> s2
a    0
b    1
c    2
d    3
dtype: int64
```

---
# Deprecation of `ix`

### Positional or label-based?

```python
>>> s1.ix[1:3]
.
.
.
.

>>> s2.ix[1:3]
.
.
.
.
```

---
count: false

# Deprecation of `ix`

### Positional or label-based?

```python
>>> s1.ix[1:3]
1    0
2    1
3    2
dtype: int64

>>> s2.ix[1:3]
b    1
c    2
dtype: int64

```

--
count: false

### Use `iloc` or `loc` instead !

---

# Towards pandas 1.0

### Clarification public API, deprecations, clean-up inconsistencies

http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization-of-the-library-privacy-changes

--
count: false

### -> continuing effort to move towards a **more stable pandas 1.0**

--
count: false

### But: accumulated technical debt, known limitations, ... -> 1.0 is not an end-point

---

???

# Broader ecosystem

---

# Apache Arrow

Standards and tools for in-memory, columnar data layouts

![:scale 49%](img/arrow_copy2.png)
![:scale 49%](img/arrow_shared2.png)

Source: https://arrow.apache.org/

---

# `to_parquet` / `read_parquet`

Apache Parquet: efficient columnar storage format (https://parquet.apache.org/)

Python bindings with `pyarrow` or `fastparquet`

```python
>>> import pyarrow.parquet as pq
>>> table = pq.read_table('example.parquet')
>>> df = table.to_pandas()
```

--
count: false

In future (https://github.com/pandas-dev/pandas/pull/15838):

```python
>>> import pandas as pd
>>> df = pd.read_parquet('example.parquet')
>>> df.to_parquet('example.parquet')
```

---

# Pandas 2.0

### Planning significant refactoring of the internals

### Goals: make pandas ...

* Faster and use less memory
* Fix long-standing limitations / inconsistencies
* Easier interoperability / extensibility

---

# Pandas 2.0

* Fixing long-standing limitations or inconsistencies (e.g. in missing data).
* Improved performance and utilization of multicore systems.
* Better user control / visibility of memory usage.
* Clearer semantics around non-NumPy data types, and permitting new pandas-only data types to be added.
* Exposing a “libpandas” C/C++ API to other Python library developers.

More information about this can be found in the “pandas 2.0” design documents: https://pandas-dev.github.io/pandas2/index.html

---

### Increasing user base

.center[
![:scale 80%](img/pandas_pypi_downloads.png)
]

--
count: false

### Only ca 6 - 10 regular contributors

---

# Pandas 1.0, 2.0, ....

### ... are ideas.

A lot of work is needed (code, design, ..). We need YOUR help.

### ... will impact users.

We need YOUR input on that.

???

Those deprecations are part of path to 1.0
with view of 2.0
    what is the vision for that?
Many changes coming, we need YOUR input on that
Pandas: growth over time (get statistics from page website?), downloads
    but limited resources, limited core contributors (6 to 10)

How can you help? In your institurion or comparny, or as an individual? * Contribute time ** Give feedback ** Report issues ** Contribute code / PRs * Contribute money ** Donations through NumFocus

Complain -> Complain publicly -> Turn into improvements of the pandas ecosystem / turn into constructive feedback (your most unhappy customers are your greatest learning source) -> we learn from this, but also knows what lives, and can also try to solve those problems

pandas-dev mailing list

contribute -> my first contribution to pandas was ... a typo correction (screenshot) don't hesitate, don't be emabarrased, just do

---

# How can you help?

--
count: false

## Complain!

---
count: false

# How can you help?

## ~~Complain!~~

## Complain publicly!

---
count: false

# How can you help?

## ~~Complain!~~

## ~~Complain publicly!~~

### Turn this into constructive feedback / improvements of the pandas ecosystem

--
count: false

[https://github.com/pandas-dev/pandas/issues](https://github.com/pandas-dev/pandas/issues)

[https://mail.python.org/pipermail/pandas-dev/](https://mail.python.org/pipermail/pandas-dev/)

---

# How can you help?

## Contribute feedback

## Contribute code

---

## My first contribution to pandas

--
count: false

![:scale 100%](img/first-contribution.png)

--
count: false

### ... and now I am a pandas core dev

---

# How can you help?

## Contribute feedback

## Contribute code

## Employ pandas developers / allow employers to contribute

---

# Employ pandas developers

.center[
.affiliations[
  ![:scale 40%](img/continuum-logo.png)
  ![:scale 50%](img/twosigma-logo.png)
  ![:scale 35%](img/inria-logo.png)
]
]
---

# How can you help?

## Contribute feedback

## Contribute code

## Employ pandas developers / allow employers to contribute

## Donate

---

# Donate

.center[
![:scale 80%](img/NumFocus_LRG.png)
]

https://www.numfocus.org/

http://pandas.pydata.org/donate.html

---

# How can you help?

## Contribute feedback

## Contribute code

## Employ pandas developers / allow employers to contribute

## Donate

---
class: middle

# Thanks for listening!

## Thanks to all contributors!

## Those slides:

- https://github.com/jorisvandenbossche/talks/
- [jorisvandenbossche.github.io/talks/2017_PyParis_pandas](
    http://jorisvandenbossche.github.io/talks/2017_PyParis_pandas)

---
# About me

Joris Van den Bossche

- PhD bio-science engineer, air quality research
- pandas core dev
- Currently working at the Paris-Saclay Center for Data Science (Inria)

https://github.com/jorisvandenbossche

[@jorisvdbossche](https://twitter.com/jorisvdbossche)