Activity on the pandas github repo during the March 10 documentation sprint

Last weekend, Marc Garcia and many others organised a world-wide pandas documentation sprint (https://python-sprints.github.io/pandas/). The goal was to improve the pandas API documentation, and I have to say, it was a great success!

I thought it would be nice to make a figure of the activity on github during the sprint. Using https://www.githubarchive.org/ and the bigquery interface to their data, it was quite easy. The following query counts the hourly number of events on the pandas-dev/pandas repo for the last two weeks:

SELECT 
  STRFTIME_UTC_USEC(created_at, "%Y-%m-%d %H") AS timestamp,
  COUNT(*) AS count
FROM (
  TABLE_DATE_RANGE([githubarchive:day.], 
    TIMESTAMP('2018-03-01'), 
    TIMESTAMP('2018-03-13')
  )) 
WHERE repo.name = 'pandas-dev/pandas'
GROUP BY
  timestamp,
ORDER BY
  timestamp ASC

The above query looks for all types of events on github, so it's a total of issues or PRs opened or closed, comments, pushed, ... (https://developer.github.com/v3/activity/events/types/).

I downloaded the result of the above query as a csv file (note: there are also packages available to directly load the result of the query in a pandas DataFrame). So we can now use pandas and matplotlib to make a graph of it.

In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
In [2]:
events = pd.read_csv("results-20180313-132419.csv", index_col=0, parse_dates=True)
In [3]:
events.head()
Out[3]:
count
timestamp
2018-03-01 00:00:00 6
2018-03-01 01:00:00 24
2018-03-01 02:00:00 13
2018-03-01 03:00:00 4
2018-03-01 04:00:00 1

Some of the hours are missing because there were no recorded events, so to make sure we have a regular time series, I am using resample to have an hourly frequency while filling the missing hours with 0:

In [4]:
events = events.resample('H').asfreq().fillna(0)['count']

Now we can make a plot of this:

In [5]:
fig, ax = plt.subplots(dpi=120)
events.plot(ax=ax)
ax.set(xlabel='', ylabel="Number of hourly events", title="GitHub activity in the pandas repo")
ax.annotate("What happened here?", (pd.Timestamp("2018-03-10"), 150), (pd.Timestamp("2018-03-02"), 200),
            arrowprops=dict(shrink=0.05, width=1, color='k'), fontsize=14)
fig.tight_layout()

So as expected, we clearly see a huge peak in github activity compared to the weeks before :-)

Many thanks to all organizers and contributors of the sprint. A lot of people learned about contributing to open source, ànd it made a significant impact on the quality of the pandas API documentation!

This post was written in the Jupyter notebook. You can download this notebook.

Comments