Data wrangling in Python

class: center, middle

# Data wrangling in Python

Flanders’ Training Network for Methodology and Statistics (FLAMES) 
October 16 and 23, 2023

Joris Van den Bossche, Stijn Van Hoey

https://github.com/jorisvandenbossche/FLAMES-python-data-wrangling

---
class: center, middle

# Who are you?

Go to https://hackmd.io/SEWMuyGfSRGbbaTn4AI97A?both

---

### Joris Van den Bossche

<a href="https://twitter.com/jorisvdbossche"><img src="./static/img/icon_twitter.svg" alt="Twitter logo" class="icon"> jorisvdbossche</a>
<a href="https://github.com/jorisvandenbossche"><img src="./static/img/icon_github.svg" alt="Github logo" class="icon"> jorisvandenbossche</a>

* Open source software developer and teacher
* Pandas, GeoPandas, scikit-learn, Apache Arrow

.center[
![:scale 90%](./static/img/work_joris_1.png)]

---

### Stijn Van Hoey

<a href="https://twitter.com/svanhoey"><img src="./static/img/icon_twitter.svg" alt="Twitter logo"> SVanHoey</a>
<a href="https://github.com/stijnvanhoey"><img src="./static/img/icon_github.svg" alt="Github logo"> stijnvanhoey</a>

* Freelance developer and teacher
* Research software engineer at [Fluves](https://www.fluves.com/)

.center[
![:scale 90%](./static/img/work_stijn_1.png)]

---
class: middle, section_background

# Setting up a working environment

---

## Setting up a working environment

For the setup instructions, see the [setup page](https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/setup.html).

---
class: left, middle

0. Everyone has conda installed and the environment setup? If not, see [1-install-python-and-the-required-python-packages](https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/setup.html#1-install-python-and-the-required-python-packages)
1. Make sure to (re)download ALL the course material, see [2-getting-the-course-materials](https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/setup.html#2-first-day-of-the-course-getting-the-course-materials) also if you already did this before.
2. Next, also do section 3 and 4 of the [setup](https://jorisvandenbossche.github.io/FLAMES-python-data-wrangling/setup.html)

> If you succesfully done 1, 2 and 3, put up your `green sticky note` on your laptop screen..

- Surf to and fill in [the questionnaire](https://hackmd.io/SEWMuyGfSRGbbaTn4AI97A?both)
- In Jupyter Lab, start with the 'notebooks/00-jupyter_introduction.ipynb'.

> Installation or setup issues? Put up your `red sticky note` on your laptop screen.

---
class: center, middle

When you see something like this...

![:scale 100%](./static/img/startup.png)

...relax, you're ready to start!

---
class: center, middle

![:scale 100%](https://i.ytimg.com/vi/PlaYMh-u-2w/maxresdefault.jpg)

---
class: middle, left

### Time is divided between

- group sessions: we explain new concepts (aka 'theory')
- practise sessions: you work on exercises or case studies

In case of questions, remarks, suggestions, you can always interrupt us and just ask.

During practise sessions, use the `red sticky note` on top of your laptop screen to let us know you have a question.

### Status check

We will regularly ask for a check (ready with exercise, installation succesfull...). Use the `green sticky note` on top of your laptop screen to say 👍.

### Feel lost?

Just ask either one of us, we are here to help you.

---
class: middle, center

![:scale 80%](./static/img/issuetracker.png)

Report bugs, typo's, suggestions... as issues ([New issue](https://github.com/jorisvandenbossche/course-python-data/issues/new))

or see the [contributing guidelines](https://github.com/jorisvandenbossche/course-python-data/blob/main/CONTRIBUTING.md)

---
class: middle, section_background

# Introduction

---
class: center, middle

index | date
:-----:|:----:
2  | 19930000
8  | 1992-930
27  | 20050500
34  | 201405.01
162 | 7/9/2287
1400  | 0.0
2800 | start of the year 2015
3777 | Summer
8733  | 2013-2016
26766  | 26/09/2002 and later 1/1/2016
40788  | Nan
41277  | /
51002  | -999
51007  | -9999

---
class: center, middle

![:scale 100%](./static/img/datacleaning1.jpg)

---
class: center, middle

![:scale 100%](./static/img/datacleaning2.jpg)

---
class: middle, section_background

# Working with Python

---

# Conda

### Why using conda?

- Consistent package manager across Windows, Mac and Linux
- Many precompiled packages available
- Less problems with installation!

### Why different environments?

- Manage the dependencies of a specific project/paper/group/...
- You can install different version of Python and other packages alongside on your computer
- Easily share environments with other

---
## Small overview of conda commands

Creating a new environment:

```
conda create -n my_env python=3.9 pandas

# or from environment file
conda env create -f environment.yml
```

Activating an environment:

```
conda activate my_env
```

Install a new package:

```
conda install matplotlib     # if not working, try: pip install ...
```

List all installed packages: `conda list`

List all your environments: `conda info -e`

See the docs: https://docs.conda.io/projects/conda/en/latest/user-guide/index.html

---
class: center, middle

### Keep track of your python ecosystem with an `environment.yml`

```
conda env export > environment.yml
```

---

# Writing Python code

## IPython console

.center[![:scale 75%](./static/img/ipython.png)]

---

## Interactive Development Environment (IDE)

* [**Spyder**](https://pythonhosted.org/spyder/) is shipped with Anaconda. The familiar environment for Matlab/Rstudio-users.
* [**VS Code**](https://code.visualstudio.com/) is also shipped with Anaconda. General purpose editor with powerful plugin ecosystem.
* [**PyCharm**](https://www.jetbrains.com/pycharm/): Popular for web-development and Django applications, powerful when doing 'real' development (packages, libraries, software)
* [Eclipse + **pyDev plugin**](http://www.pydev.org/): If you already work in Eclipse, add the python environment
* [**Atom**](https://atom.io/), ...

---

## Jupyter Lab/Notebook
(*previously called IPython notebook*)

**Jupyter notebook** provides an **interactive** scripting environment, ideal for exploration, prototyping,...

.center[![:scale 70%](./static/img/notebook.png)]

...the stuff we're dealing with in this course!

---
class: middle, day_background

# Overview of the first day

* Introduction to Jupyter notebook
* Pandas fundamentals
* *Case study:* Bike count

---
# After DAY 1

## How are you feeling?

### https://forms.gle/JX2GdHcweUqAL28G9

---
class: middle, day_background

# Overview of the second day

* Advanced Pandas
* Visualization: Introduction to Matplotlib
* Pandas: reshaping data
* *Case study:* Organism observations
* *Case study:* Bacterial resistance lab study

---
class: middle, section_background

# Pandas

---

## Pandas: data analysis in python

For data-intensive work in Python the [Pandas](http://pandas.pydata.org) library has become essential.

What is `pandas`?

* Pandas can be thought of as *NumPy arrays with labels* for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
* Pandas can also be thought of as `R`'s `data.frame` or `tidyverse` in Python.
* Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...

It's documentation: http://pandas.pydata.org/pandas-docs/stable/

---

## When do you need pandas?

When working with **tabular or structured data** (like R dataframe, SQL table, Excel spreadsheet, ...):

- Import data
- Clean up messy data
- Explore data, gain insight into data
- Process and prepare your data for analysis
- Analyse your data (together with scikit-learn, statsmodels, ...)

---

## Attention!

Pandas is great for working with heterogeneous and tabular 1D/2D data, but not all types of data fit in such structures!

* When working with array data (e.g. images, numerical algorithms): just stick with Numpy
* When working with multidimensional labeled data (e.g. climate data): have a look at [xarray](http://xarray.pydata.org/en/stable/)

---
class: middle, section_background

# Tidy data

---
class: center, middle
background-image: url(./static/img/tidy_data_paper.png)

.footnote[Wickham, H. (2014) Tidy Data, Vol. 59, Issue 10, Journal of Statistical Software. doi:10.18637/jss.v059.i10]

---
class: center, middle

| WWTP | Treatment A | Treatment B |
|:------|-------------|-------------|
| Destelbergen | 8.  | 6.3 |
| Landegem | 7.5  | 5.2 |
| Dendermonde | 8.3  | 6.2 |
| Eeklo | 6.5  | 7.2 |

---
class: center, middle

| WWTP | Treatment | pH |
|:------|:-------------:|:-------------:|
| Destelbergen | A  | 8. |
| Landegem | A  | 7.5 |
| Dendermonde | A  | 8.3 |
| Eeklo | A  | 6.5 |
| Destelbergen | B  | 6.3 |
| Landegem | B  | 5.2 |
| Dendermonde | B  | 6.2 |
| Eeklo | B  | 7.2 |

---
class: center, middle

.center[![:scale 100%](./static/img/tidy_data_scheme.png)]

---
class: center, middle

# How are you feeling?

![:scale 100%](http://esq.h-cdn.co/assets/15/51/980x490/landscape-1450137389-john-cleese.JPG)

### https://forms.gle/eaCeaheJXv8vCTcL8

Please fill in the questionnaire!

---
class: center, middle

# Closing notes

---
class: center, middle

# Python's scientific ecosystem

#### ## Adjusted from figure by Jake VanderPlas

---
class: center, middle, bgheader
background-image: url(./static/img/JakeVdP-ecosystem1.svg)
background-size: cover

---
count: false
class: center, middle, bgheader
background-image: url(./static/img/JakeVdP-ecosystem2.svg)
background-size: cover

---
count: false
class: center, middle, bgheader
background-image: url(./static/img/JakeVdP-ecosystem3.svg)
background-size: cover

---
count: false
class: center, middle, bgheader
background-image: url(./static/img/JakeVdP-ecosystem4.svg)
background-size: cover

---
count: false
class: center, middle, bgheader
background-image: url(./static/img/JakeVdP-ecosystem5.svg)
background-size: cover

---

# A rich ecosystem of packages:

**Machine learning**: scikit-learn, tensorflow, pytorch, keras, chainer, ...

**Performance**: Numba, Cython, Numexpr, Pythran, C/Fortran wrappers, ...

**Visualisation**: Bokeh, Seaborn, Plotnine, Altair, Plotly, Mayavi, HoloViews, datashader, vaex ...

**Data structures and parallel/distributed computation**: Xarray, Dask, Distributed, Cupy, ...

Specialized packages in many **scientific fields**: astronomy, natural language processing, image processing, geospatial, ...

**Packaging and distribution**: pip/wheels, conda, Anaconda, Canopy, ...

---
class: center, middle

### Reading advice

[Good Enough Practices in Scientific Computing](https://arxiv.org/pdf/1609.00037v1.pdf)

> "*However, while most scientists are careful to validate their laboratory and field equipment, most do not know how reliable their software is*"

---
class: center, middle

# Thanks

### Joris Van den Bossche

<a href="https://twitter.com/jorisvdbossche"><img src="./static/img/icon_twitter.svg" alt="Twitter logo"> jorisvdbossche</a>
<a href="https://github.com/jorisvandenbossche"><img src="./static/img/icon_github.svg" alt="Github logo"> jorisvandenbossche</a>

### Stijn Van Hoey