Reputation: 433
I am trying to get various Github repo metrics in Github Archive through Big Query(doc here). However, when I try to count the number of forks, the number I am getting is very different from the number of forks specified in the Github UI. For instance when I run this sql script:
SELECT repo.url,repo.name , COUNT(*) fork_count,
FROM [githubarchive:year.2011],
[githubarchive:year.2012],
[githubarchive:year.2013],
[githubarchive:year.2014],
[githubarchive:year.2015],
[githubarchive:year.2016],
[githubarchive:year.2017],
[githubarchive:year.2018],
[githubarchive:month.201901]
WHERE type='ForkEvent'
and repo.url like 'https://github.com/python/cpython'
GROUP BY 1,2
I get a result of:
Row repo_url repo_name fork_count
1 https://github.com/python/cpython cpython 177
However when I go to the URL 'https://github.com/python/cpython' I see that there are 8,198 forks. What is the reason for this discrepancy?
EDIT:
Felipe pointed out below that there could be multiple URL's for the same repo.
However, even with multiple URLS, the number was not a exact match to the UI and this time was substantially larger then the UI's number. Is there any way to get an exact match?
Upvotes: 2
Views: 311
Reputation: 59175
What are you querying for? Notice you'll get different results depending if you go for the repo id, name, or url:
#standardSQL
SELECT repo.name, repo.id, repo.url, COUNT(*) c
FROM `githubarchive.month.201*`
WHERE type='ForkEvent'
AND (
repo.id = 81598961
OR repo.name='python/cpython'
OR repo.url like 'https://github.com/python/cpython'
)
GROUP BY 1,2,3
If you want to know "when?":
#standardSQL
SELECT repo.name, repo.id, repo.url, COUNT(*) c
, MIN(DATE(created_at)) since, MAX(DATE(created_at)) until
FROM `githubarchive.month.201*`
WHERE type='ForkEvent'
AND (
repo.id = 81598961
OR repo.name='python/cpython'
OR repo.url like 'https://github.com/python/cpython'
)
GROUP BY 1,2,3
ORDER BY since
EDIT:
GitHub only lists one fork per user - so if you want to remove duplicates do COUNT(DISTINCT actor.id) which brings it down to ~9k.
Upvotes: 1