The handling of data is a recurring task for most scientists. Reading in experimental data, checking its properties, and creating visualisations may become tedious tasks. Hence, increasing the efficiency in this process is beneficial for many scientists. Spreadsheet-based software lacks the ability to properly support this process, due to the lack of automation and repeatability. The usage of a high-level scripting language such as Python is ideal for these tasks.
This course trains students to use Python effectively to do these tasks. The course focuses on data manipulation and cleaning, explorative analysis and visualisation using some important packages such as Pandas, Numpy and Matplotlib.
The course does not cover statistics, data mining, machine learning, or predictive modelling. It aims to provide researchers the means to effectively tackle commonly encountered data handling tasks in order to increase the overall efficiency of the research.
The course has been developed as a course for the Flanders’ Training Network for Methodology and Statistics (Flames), but can be taught to others upon request.
This course is intended for researchers that have at least basic programming skills. A basic (scientific) programming course that is part of the regular curriculum should suffice. For those who have experience in another programming language (e.g. Matlab, R, …), following a Python tutorial prior to the course is advised.
It is intended for researchers that want to enhance their general data manipulation and analysis skills in Python. The course is NOT intended to be a course on statistics or machine learning.
The course is organized as a two day course with following program:
Day 1 - Setting up the programming environment with the required packages using the conda package manager and an introduction of the Jupyter notebook environment are covered. The day focuses on the essential concepts of the data analysis package Pandas.
Day 2 - More advanced usage of Pandas for different data cleaning and manipulation tasks is taught. Next, data visualisation with Pandas and Matplotlib are explained. The acquired skills will immediately be brought into practice to handle real-world data sets. Applications include time series handling, categorical data, merging data,…
The course uses Python 3 and some data analysis packages such as Pandas, Numpy and Matplotlib. To install the required libraries, we highly recommend Anaconda or miniconda (https://www.anaconda.com/download/) or another Python distribution that includes the scientific libraries (this recommendation applies to all platforms, so for both Window, Linux and Mac).
For detailed instructions to get started on your local machine , see the setup instructions.
For the course slides, click here.
Found any typo or have a suggestion, see how to contribute.
Authors: Joris Van den Bossche, Stijn Van Hoey