walker_4
walker_4

Reputation: 433

Why does the number of forks in Github Archive on Big Query not match the UI?

I am trying to get various Github repo metrics in Github Archive through Big Query(doc here). However, when I try to count the number of forks, the number I am getting is very different from the number of forks specified in the Github UI. For instance when I run this sql script:

SELECT repo.url,repo.name , COUNT(*) fork_count, 
FROM [githubarchive:year.2011],
  [githubarchive:year.2012],
  [githubarchive:year.2013],
  [githubarchive:year.2014],
  [githubarchive:year.2015],
  [githubarchive:year.2016],
  [githubarchive:year.2017],
  [githubarchive:year.2018],
  [githubarchive:month.201901]
WHERE type='ForkEvent'
and repo.url like 'https://github.com/python/cpython'
GROUP BY 1,2

I get a result of:

Row repo_url                           repo_name   fork_count    
1   https://github.com/python/cpython   cpython    177

However when I go to the URL 'https://github.com/python/cpython' I see that there are 8,198 forks. What is the reason for this discrepancy?

EDIT:

Felipe pointed out below that there could be multiple URL's for the same repo. Felipe's output

However, even with multiple URLS, the number was not a exact match to the UI and this time was substantially larger then the UI's number. Is there any way to get an exact match?

Upvotes: 2

Views: 311

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

What are you querying for? Notice you'll get different results depending if you go for the repo id, name, or url:

#standardSQL
SELECT repo.name, repo.id, repo.url, COUNT(*) c
FROM `githubarchive.month.201*`
WHERE type='ForkEvent'
AND (
  repo.id = 81598961 
  OR repo.name='python/cpython'
  OR repo.url like 'https://github.com/python/cpython'
)
GROUP BY 1,2,3

enter image description here

If you want to know "when?":

#standardSQL
SELECT repo.name, repo.id, repo.url, COUNT(*) c
  , MIN(DATE(created_at)) since, MAX(DATE(created_at)) until
FROM `githubarchive.month.201*`
WHERE type='ForkEvent'
AND (
  repo.id = 81598961 
  OR repo.name='python/cpython'
  OR repo.url like 'https://github.com/python/cpython'
)
GROUP BY 1,2,3
ORDER BY since

enter image description here

EDIT:

GitHub only lists one fork per user - so if you want to remove duplicates do COUNT(DISTINCT actor.id) which brings it down to ~9k.

Upvotes: 1

Related Questions