class: center, middle # Data wrangling in Python Flanders’ Training Network for Methodology and Statistics - FLAMES
18-19 October 2021 Joris Van den Bossche, Stijn Van Hoey https://github.com/jorisvandenbossche/FLAMES-python-data-wrangling --- class: center, middle # Who are you? Go to https://hackmd.io/fvTt3rBkRUKV_gAtTqesFQ?both
--- ### Joris Van den Bossche
jorisvdbossche
,
jorisvandenbossche
* Open source software developer and teacher * Pandas, GeoPandas, scikit-learn, Apache Arrow --- ### Stijn Van Hoey
SVanHoey
,
stijnvanhoey
* Research software engineer .center[ ] --- class: middle, section_background # Setting up a working environment --- ## Setting up a working environment For the setup instructions, see the [setup page](https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/setup.html). --- class: left, middle 0. Everyone has conda installed and the environment setup? If not, see [1-install-python-and-the-required-python-packages](https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/setup.html#1-install-python-and-the-required-python-packages) 1. Make sure to (re)download ALL the course material, see [2-getting-the-course-materials](https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/setup.html#2-getting-the-course-materials) also if you already did this before. 2. Next, also do section 3 and 4 of the [setup](https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/setup.html) > If you succesfully done 0, 1 and 2, put up your `green sticky note` on your laptop screen.. Next: - Surf to and fill in [the questionnaire](https://hackmd.io/fvTt3rBkRUKV_gAtTqesFQ?both) - In Jupyter Lab, start with the 'notebooks/00-jupyter_introduction.ipynb'. > Installation or setup issues? Put up your `orange/red/yellow sticky note` on your laptop screen. --- class: center, middle When you see something like this...  ...relax, you're ready to start! --- class: center, middle  --- class: middle, left ### Time is divided between - group sessions: we explain new concepts (aka 'theory') - practise sessions: you work on exercises In case of questions, remarks, suggestions, you can always interrupt us and just ask. During practise sessions, use the `orange/red/yellow sticky note` on top of your laptop screen to let us know you have a question. ### Status check We will regularly ask for a check (ready with exercise, installation succesfull...). Use the `green sticky note` on top of your laptop screen to say 👍. ### Feel lost? Just ask either one of us, we are here to help you. --- class: middle, center  Report bugs, typo's, suggestions... as issues ([New issue](https://github.com/jorisvandenbossche/FLAMES-python-data-wrangling/issues/new)) or see the [contributing guidelines](https://github.com/jorisvandenbossche/FLAMES-python-data-wrangling/blob/master/CONTRIBUTING.md) --- class: middle, section_background # Introduction --- class: center, middle index | date :-----:|:----: 2 | 19930000 8 | 1992-930 27 | 20050500 34 | 201405.01 162 | 7/9/2287 1400 | 0.0 2800 | start of the year 2015 3777 | Summer 8733 | 2013-2016 26766 | 26/09/2002 and later 1/1/2016 40788 | Nan 41277 | / 51002 | -999 51007 | -9999 -- .center[Never underestimate the creativity of humans!] --- class: center, middle  --- class: center, middle  --- class: middle, section_background # Working with Python --- # Conda ### Why using conda? - Consistent package manager across Windows, Mac and Linux - Many precompiled packages available - Less problems with installation! -- ### Why different environments? - Manage the dependencies of a specific project/paper/group/... - You can install different version of Python and other packages alongside on your computer - Easily share environments with other --- ## Small overview of conda commands Creating a new environment: ``` conda create -n my_env python=3.7 pandas # or from environment file conda env create -f environment.yml ``` Activating an environment: ``` conda activate my_env ``` Install a new package: ``` conda install matplotlib # if not working, try: pip install ... ``` List all installed packages: `conda list` List all your environments: `conda info -e` See the docs: http://conda.pydata.org/docs/using/index.html --- class: center, middle ### Keep track of your python ecosystem
with an `environment.yml`
``` conda env export > environment.yml ``` --- # Writing Python code ## IPython console
.center[] --- ## Interactive Development Environment (IDE) * [**Spyder**](https://pythonhosted.org/spyder/) is shipped with Anaconda. The familiar environment for Matlab/Rstudio-users... * [**PyCharm**](https://www.jetbrains.com/pycharm/): Popular for web-development and Django applications, powerful when doing 'real' development (packages, libraries, software) * [Eclipse + **pyDev plugin**](http://www.pydev.org/): If you like working in Eclipse, just add the python environment * [**VS Code**](https://code.visualstudio.com/), [**Atom**](https://atom.io/), ... --- ## Jupyter Lab/Notebook
(*previously called IPython notebook*)
**Jupyter notebook** provides an **interactive** scripting environment,
ideal for exploration, prototyping,... .center[] -- ...the stuff we're dealing with in this course! --- class: middle, day_background # Overview of the first day * Introduction to Jupyter notebook * Pandas fundamentals * *Case study:* Bike count --- class: middle, section_background # Pandas --- class: center, middle  --- ## Pandas: data analysis in python For data-intensive work in Python the [Pandas](http://pandas.pydata.org) library has become essential. What is `pandas`? * Pandas can be thought of as *NumPy arrays with labels* for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that. * Pandas can also be thought of as `R`'s `data.frame` or `tidyverse` in Python. * Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ... It's documentation: http://pandas.pydata.org/pandas-docs/stable/ --- ## When do you need pandas? When working with **tabular or structured data** (like R dataframe, SQL table, Excel spreadsheet, ...): - Import data - Clean up messy data - Explore data, gain insight into data - Process and prepare your data for analysis - Analyse your data (together with scikit-learn, statsmodels, ...) --- ## Attention! Pandas is great for working with heterogeneous and tabular 1D/2D data, but not all types of data fit in such structures! * When working with array data (e.g. images, numerical algorithms): just stick with Numpy * When working with multidimensional labeled data (e.g. climate data): have a look at [xarray](http://xarray.pydata.org/en/stable/) --- ## Key features * Fast, easy and flexible input/output for a lot of different data formats * Working with missing data (`.dropna()`, `pd.isnull()`) * Merging and joining (`concat`, `join`) * Grouping: `groupby` functionality * Reshaping (`stack`, `pivot`) * Powerful time series manipulation (resampling, timezones, ..) * Easy plotting --- # Let's get to it! ### Pandas fundamentals... 1. Data structures `pandas_01_data_structures.ipynb` 2. Basic Operations `pandas_02_basic_operations.ipynb` 3. Indexing and selecting data `pandas_03a_selecting_data.ipynb` 4. Time series data `pandas_04_time_series_data.ipynb` 5. (Combining data `pandas_05_combining_datasets.ipynb`) --- # After DAY 1 ## How are you feeling? ### https://forms.gle/JX2GdHcweUqAL28G9 --- class: middle, day_background # Overview of the second day * Advanced Pandas * Visualization: Introduction to Matplotlib * Pandas: reshaping data * *Case study:* Organism observations * *Case study:* Bacterial resistance lab study --- class: center, middle # How are you feeling?  ### https://forms.gle/JX2GdHcweUqAL28G9 Please fill in the questionnaire! --- class: center, middle # Closing notes --- class: center, middle # Python's scientific ecosystem #### ## Adjusted from figure by Jake VanderPlas --- class: center, middle, bgheader background-image: url(./img/JakeVdP-ecosystem1.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(./img/JakeVdP-ecosystem2.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(./img/JakeVdP-ecosystem3.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(./img/JakeVdP-ecosystem4.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(./img/JakeVdP-ecosystem5.svg) background-size: cover --- # A rich ecosystem of packages:
**Machine learning**: scikit-learn, tensorflow, pytorch, keras, chainer, ... **Performance**: Numba, Cython, Numexpr, Pythran, C/Fortran wrappers, ... **Visualisation**: Bokeh, Seaborn, Plotnine, Altair, Plotly, Mayavi, HoloViews, datashader, vaex ... **Data structures and parallel/distributed computation**: Xarray, Dask, Distributed, Cupy, ... Specialized packages in many **scientific fields**: astronomy, natural language processing, image processing, geospatial, ... **Packaging and distribution**: pip/wheels, conda, Anaconda, Canopy, ... --- class: center, middle ### Reading advice [Good Enough Practices in Scientific Computing](https://arxiv.org/pdf/1609.00037v1.pdf) > "*However, while most scientists are careful to validate their laboratory and field equipment, most do not know how reliable their software is*" --- class: center, middle ## Good luck! 