histelheim
histelheim

Reputation: 5088

Tracing the growth of top 100 repositories on GitHub?

I am trying to trace the growth of the top 100 repositories on GitHub. I have the following query:

SELECT MAX(repository_forks) as forks, repository_url 
FROM [publicdata:samples.github_timeline] 
WHERE (created_at CONTAINS "2012-04-01") 
GROUP BY repository_url 
ORDER BY forks DESC LIMIT 100

Which gives me the 100 largest repositories by forks on April 1st of 2012. Then I want to trace how the number of forks grows for each month for each repository.

Hence I constructed the following query:

SELECT repository_name, created_at as month,
SUM(repository_forks) as forks
FROM [githubarchive:github.timeline]
WHERE (repository_name = "rubinius")
GROUP BY repository_name, month
ORDER BY month DESC;

This kind of gives me what I want, but not quite. Instead I need the query to:

How can I accomplish this?

Upvotes: 3

Views: 393

Answers (1)

greeness
greeness

Reputation: 16104

The data in [publicdata:samples.github_timeline] looks like snapshots of every repository at different timestamps. If that is the case, to calculate the change of fork number per repo per month, I don't think you should do SUM(repository_forks). Instead you want to get the first snapshot and the last snapshot of every month and do a minus calculation to get the delta.

The result is from the below query:

select repository_name, created_at, repository_forks  
from [publicdata:samples.github_timeline]
where repository_name='Bukkit'
order by created_at;

enter image description here

However, I don't understand why at 2012-03-11 08:30:21, the number of repository_forks from Bukkit is zero. It might be a data error? If it is data error, I will treat them as outliers. Setting some threshold on it might be able to remove those outliers. Note the threshold I set: where repository_forks > 10 in order to skip the bad data.

SELECT top100.repository_name, 
  substr(created_at, 0, 7) month, 
  max(repository_forks)-min(repository_forks) monthly_increase, 
  min(repository_forks) monthly_begin_at, 
  max(repository_forks) monthly_end_with 
FROM [githubarchive:github.timeline] timeline
JOIN
(SELECT repository_name , MAX(repository_forks) as forks 
FROM [githubarchive:github.timeline] 
WHERE (created_at CONTAINS "2012-04-01") 
GROUP BY repository_name 
ORDER BY forks DESC LIMIT 100) top100
on timeline.repository_name = top100.repository_name
where repository_forks > 10
GROUP BY top100.repository_name, month
ORDER BY top100.repository_name, month;

And the result looks like:

enter image description here

If I am wrong and the number of repository_forks is already a change, you can go ahead and do the sum over repository_forks as what you did. Then it's actually easier:

SELECT repository_name, substr(created_at,0,7) as month, SUM(repository_forks) as forks
FROM [publicdata:samples.github_timeline] timeline
JOIN
(SELECT repository_url , MAX(repository_forks) as forks 
FROM [publicdata:samples.github_timeline] 
WHERE (created_at CONTAINS "2012-04-01") 
GROUP BY repository_url 
ORDER BY forks DESC LIMIT 100) top100
on timeline.repository_url = top100.repository_url
GROUP BY repository_name, month
ORDER BY repository_name, month DESC;

enter image description here

Update:

yes. I changed the dataset to point to githubarchive:github.timeline, then I have data until December, 2012. Corresponding sql and results are updated. But the data quality is not good, still see a lot of outlier data points.

Upvotes: 2

Related Questions