Reputation: 67
Missing data in Github Archive on Big Query?
Using BigQuery's tables from the Github Archive, and running a query on pull-requests for the typelevel/cats repo, there's no entries prior to 1/1/2016, despite the actual repo showing activity beginning in 1/28/2015.
Link to github repo showing earlier pull requests
Query is below. Wanted to check on this to see if it was my error or misunderstanding, or if there were perhaps some repos which were only partially available in the BQ tables.
SELECT
DATE(created_at) AS date, repo.name, count(*) AS num_PR
FROM
(TABLE_DATE_RANGE([githubarchive:day.],
TIMESTAMP('2014-09-26'),
TIMESTAMP('2016-09-26')
))
WHERE
type = 'PullRequestEvent'
AND JSON_EXTRACT(payload, '$.action') = '\"opened\"'
AND repo.name IN ('typelevel/cats')
GROUP BY date, repo.name
ORDER BY date DESC
Upvotes: 0
Views: 244
Reputation: 59175
This repo changed names, though the id continued the same:
SELECT repo.name, MIN(created_at) since, MAX(created_at) until
FROM (TABLE_DATE_RANGE([githubarchive:day.],
TIMESTAMP('2015-01-01'),
TIMESTAMP('2016-10-01')
))
WHERE repo.id = 29986727
GROUP BY 1
ORDER BY 1
repo_name since until
non/cats 2015-01-28 20:26:49 2016-01-30 20:30:41
typelevel/cats 2016-01-30 20:32:30 2016-09-30 16:47:03
Upvotes: 1