class: center, middle # Introduction to Pandas Inria, May 18, 2017 Joris Van den Bossche https://jorisvandenbossche.github.io/talks/2017_INRIA_pandas https://github.com/jorisvandenbossche/talks/ .affiliations[   ] --- # Why Python for data analysis? ### High level language ### General purpose ### Excellent interactive use -- count: false ### + a rich ecosystem of tools for scientific computing and data analysis. --- class: center, middle # Python's scientific ecosystem #### ## Thanks to Jake VanderPlas for the figure --- class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem1.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem2.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem3.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem4.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem5.svg) background-size: cover --- class: center, middle  --- ## Pandas: data analysis in python For data-intensive work in Python the [Pandas](http://pandas.pydata.org) library has become essential. What is `pandas`? * Pandas can be thought of as *NumPy arrays with labels* for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that. * Pandas can also be thought of as `R`'s `data.frame` in Python. * Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ... It's documentation: http://pandas.pydata.org/pandas-docs/stable/ --- # When do you need pandas? When working with **tabular or structured data** (like R dataframe, SQL table, Excel spreadsheet, ...): - Import data - Clean up messy data - Explore data, gain insight into data - Process and prepare your data for analysis - Analyse your data (together with scikit-learn, statsmodels, ...) --- # Key features * Fast, easy and flexible input/output for a lot of different data formats * Working with missing data (`.dropna()`, `pd.isnull()`) * Merging and joining (`concat`, `join`) * Grouping: `groupby` functionality * Reshaping (`stack`, `pivot`) * Powerful time series manipulation (resampling, timezones, ..) * Easy plotting --- # Showcase notebook * Interact with your data * Import / export for variety of formats * Easy plotting * Set of powerful methods to maniputlate data https://github.com/jorisvandenbossche/talks/blob/master/2017_INRIA_pandas/pandas_introduction.ipynb (solved version is available in same repo) --- # Further reading * Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/ * Books * "Python for Data Analysis" by Wes McKinney * "Python Data Science Handbook" by Jake VanderPlas --- # Further reading * Tutorials * https://github.com/jorisvandenbossche/pandas-tutorial * https://github.com/brandon-rhodes/pycon-pandas-tutorial * Tom Augspurger's blog * https://tomaugspurger.github.io/modern-1.html --- class: middle # Thanks for listening! ## Those slides: - [jorisvandenbossche.github.io/talks/2016_INRIA_pandas]( http://jorisvandenbossche.github.io/talks/2017_INRIA_pandas)