Activity on the pandas github repo during the March 10 documentation sprint
Last weekend, Marc Garcia and many others organised a world-wide pandas documentation sprint (https://python-sprints.github.io/pandas/). The goal was to improve the pandas API documentation, and I have to say, it was a great success!
Really proud of everybody participating at the #pandasSprint. It's been an amazing day. Special thanks to all the organizers, who did a fantastic job. And to the pandas core developers @jorisvdbossche, @jreback and @TomAugspurger who do an unbelievable job every day. pic.twitter.com/iBZ7dXJDoi
— Marc Garcia (@datapythonista) March 10, 2018
I thought it would be nice to make a figure of the activity on github during the sprint. Using https://www.githubarchive.org/ and the bigquery interface to their data, it was quite easy. The following query counts the hourly number of events on the pandas-dev/pandas repo for the last two weeks:
SELECT
STRFTIME_UTC_USEC(created_at, "%Y-%m-%d %H") AS timestamp,
COUNT(*) AS count
FROM (
TABLE_DATE_RANGE([githubarchive:day.],
TIMESTAMP('2018-03-01'),
TIMESTAMP('2018-03-13')
))
WHERE repo.name = 'pandas-dev/pandas'
GROUP BY
timestamp,
ORDER BY
timestamp ASC
The above query looks for all types of events on github, so it's a total of issues or PRs opened or closed, comments, pushed, ... (https://developer.github.com/v3/activity/events/types/).
I downloaded the result of the above query as a csv file (note: there are also packages available to directly load the result of the query in a pandas DataFrame). So we can now use pandas and matplotlib to make a graph of it.
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
events = pd.read_csv("results-20180313-132419.csv", index_col=0, parse_dates=True)
events.head()
Some of the hours are missing because there were no recorded events, so to make sure we have a regular time series, I am using resample
to have an hourly frequency while filling the missing hours with 0:
events = events.resample('H').asfreq().fillna(0)['count']
Now we can make a plot of this:
fig, ax = plt.subplots(dpi=120)
events.plot(ax=ax)
ax.set(xlabel='', ylabel="Number of hourly events", title="GitHub activity in the pandas repo")
ax.annotate("What happened here?", (pd.Timestamp("2018-03-10"), 150), (pd.Timestamp("2018-03-02"), 200),
arrowprops=dict(shrink=0.05, width=1, color='k'), fontsize=14)
fig.tight_layout()
So as expected, we clearly see a huge peak in github activity compared to the weeks before :-)
Many thanks to all organizers and contributors of the sprint. A lot of people learned about contributing to open source, ànd it made a significant impact on the quality of the pandas API documentation!
This post was written in the Jupyter notebook. You can download this notebook.