Reputation: 5088
I am trying to trace the growth of the top 100 repositories on GitHub. I have the following query:
SELECT MAX(repository_forks) as forks, repository_url
FROM [publicdata:samples.github_timeline]
WHERE (created_at CONTAINS "2012-04-01")
GROUP BY repository_url
ORDER BY forks DESC LIMIT 100
Which gives me the 100 largest repositories by forks on April 1st of 2012. Then I want to trace how the number of forks grows for each month for each repository.
Hence I constructed the following query:
SELECT repository_name, created_at as month,
SUM(repository_forks) as forks
FROM [githubarchive:github.timeline]
WHERE (repository_name = "rubinius")
GROUP BY repository_name, month
ORDER BY month DESC;
This kind of gives me what I want, but not quite. Instead I need the query to:
How can I accomplish this?
Upvotes: 3
Views: 393
Reputation: 16104
The data in [publicdata:samples.github_timeline]
looks like snapshots of every repository at different timestamps. If that is the case, to calculate the change of fork number per repo per month, I don't think you should do SUM(repository_forks)
. Instead you want to get the first snapshot and the last snapshot of every month and do a minus
calculation to get the delta
.
The result is from the below query:
select repository_name, created_at, repository_forks
from [publicdata:samples.github_timeline]
where repository_name='Bukkit'
order by created_at;
However, I don't understand why at 2012-03-11 08:30:21
, the number of repository_forks from Bukkit
is zero. It might be a data error? If it is data error, I will treat them as outliers. Setting some threshold on it might be able to remove those outliers. Note the threshold I set: where repository_forks > 10
in order to skip the bad data.
SELECT top100.repository_name,
substr(created_at, 0, 7) month,
max(repository_forks)-min(repository_forks) monthly_increase,
min(repository_forks) monthly_begin_at,
max(repository_forks) monthly_end_with
FROM [githubarchive:github.timeline] timeline
JOIN
(SELECT repository_name , MAX(repository_forks) as forks
FROM [githubarchive:github.timeline]
WHERE (created_at CONTAINS "2012-04-01")
GROUP BY repository_name
ORDER BY forks DESC LIMIT 100) top100
on timeline.repository_name = top100.repository_name
where repository_forks > 10
GROUP BY top100.repository_name, month
ORDER BY top100.repository_name, month;
And the result looks like:
If I am wrong and the number of repository_forks is already a change, you can go ahead and do the sum over repository_forks as what you did. Then it's actually easier:
SELECT repository_name, substr(created_at,0,7) as month, SUM(repository_forks) as forks
FROM [publicdata:samples.github_timeline] timeline
JOIN
(SELECT repository_url , MAX(repository_forks) as forks
FROM [publicdata:samples.github_timeline]
WHERE (created_at CONTAINS "2012-04-01")
GROUP BY repository_url
ORDER BY forks DESC LIMIT 100) top100
on timeline.repository_url = top100.repository_url
GROUP BY repository_name, month
ORDER BY repository_name, month DESC;
yes. I changed the dataset to point to githubarchive:github.timeline
, then I have data until December, 2012. Corresponding sql
and results are updated. But the data quality is not good, still see a lot of outlier
data points.
Upvotes: 2