Activity on the pandas github repo during the March 10 documentation sprint

di 13 maart 2018

Last weekend, Marc Garcia and many others organised a world-wide pandas documentation sprint (https://python-sprints.github.io/pandas/). The goal was to improve the pandas API documentation, and I have to say, it was a great success!

Really proud of everybody participating at the #pandasSprint. It's been an amazing day. Special thanks to all the organizers, who did a fantastic job. And to the pandas core developers @jorisvdbossche, @jreback and @TomAugspurger who do an unbelievable job every day. pic.twitter.com/iBZ7dXJDoi
— Marc Garcia (@datapythonista) March 10, 2018

I thought it would be nice to make a figure of the activity on github during the sprint. Using https://www.githubarchive.org/ and the bigquery interface to their data, it was quite easy. The following query counts the hourly number of events on the pandas-dev/pandas repo for the last two weeks:

SELECT 
  STRFTIME_UTC_USEC(created_at, "%Y-%m-%d %H") AS timestamp,
  COUNT(*) AS count
FROM (
  TABLE_DATE_RANGE([githubarchive:day.], 
    TIMESTAMP('2018-03-01'), 
    TIMESTAMP('2018-03-13')
  )) 
WHERE repo.name = 'pandas-dev/pandas'
GROUP BY
  timestamp,
ORDER BY
  timestamp ASC

The above query looks for all types of events on github, so it's a total of issues or PRs opened or closed, comments, pushed, ... (https://developer.github.com/v3/activity/events/types/).

I downloaded the result of the above query as a csv file (note: there are also packages available to directly load the result of the query in a pandas DataFrame). So we can now use pandas and matplotlib to make a graph of it.

In [1]:

%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

In [2]:

events = pd.read_csv("results-20180313-132419.csv", index_col=0, parse_dates=True)

In [3]:

events.head()

Out[3]:

	count
timestamp
2018-03-01 00:00:00	6
2018-03-01 01:00:00	24
2018-03-01 02:00:00	13
2018-03-01 03:00:00	4
2018-03-01 04:00:00	1

Some of the hours are missing because there were no recorded events, so to make sure we have a regular time series, I am using resample to have an hourly frequency while filling the missing hours with 0:

In [4]:

events = events.resample('H').asfreq().fillna(0)['count']

Now we can make a plot of this:

In [5]:

fig, ax = plt.subplots(dpi=120)
events.plot(ax=ax)
ax.set(xlabel='', ylabel="Number of hourly events", title="GitHub activity in the pandas repo")
ax.annotate("What happened here?", (pd.Timestamp("2018-03-10"), 150), (pd.Timestamp("2018-03-02"), 200),
            arrowprops=dict(shrink=0.05, width=1, color='k'), fontsize=14)
fig.tight_layout()

So as expected, we clearly see a huge peak in github activity compared to the weeks before :-)

Many thanks to all organizers and contributors of the sprint. A lot of people learned about contributing to open source, ànd it made a significant impact on the quality of the pandas API documentation!

This post was written in the Jupyter notebook. You can download this notebook.

python pandas documentation contributing

Comments