anjarp
anjarp

Reputation: 67

Missing data in Github Archive on Big Query?

Missing data in Github Archive on Big Query?

Using BigQuery's tables from the Github Archive, and running a query on pull-requests for the typelevel/cats repo, there's no entries prior to 1/1/2016, despite the actual repo showing activity beginning in 1/28/2015.

Link to github repo showing earlier pull requests

Query is below. Wanted to check on this to see if it was my error or misunderstanding, or if there were perhaps some repos which were only partially available in the BQ tables.

SELECT 
  DATE(created_at) AS date, repo.name, count(*) AS num_PR
FROM
  (TABLE_DATE_RANGE([githubarchive:day.],
     TIMESTAMP('2014-09-26'),
     TIMESTAMP('2016-09-26')
  ))
WHERE
  type = 'PullRequestEvent'
  AND JSON_EXTRACT(payload, '$.action') = '\"opened\"'
  AND repo.name IN ('typelevel/cats')
GROUP BY date, repo.name
ORDER BY date DESC

Upvotes: 0

Views: 244

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

This repo changed names, though the id continued the same:

SELECT repo.name, MIN(created_at) since, MAX(created_at) until
FROM (TABLE_DATE_RANGE([githubarchive:day.],
       TIMESTAMP('2015-01-01'),
       TIMESTAMP('2016-10-01')
    ))
WHERE repo.id = 29986727
GROUP BY 1
ORDER BY 1


repo_name       since                   until    
non/cats        2015-01-28 20:26:49     2016-01-30 20:30:41      
typelevel/cats  2016-01-30 20:32:30     2016-09-30 16:47:03      

Upvotes: 1

Related Questions