Reputation: 31
I have a problem with retrieving language information from GitHub Archive Google BigQuery since the structure of the tables changed which was at the beginning of 2015.
When querying github_timeline table I have a field named repository_language. It allows me to get my language statistics. Unfortunately for 2015 the structure has changed and the table doesn't contain any events after 2014. For example the following query doesn't return any data:
select
repository_language, repository_url, created_at
FROM [githubarchive:github.timeline]
where
PARSE_UTC_USEC(created_at) > PARSE_UTC_USEC('2015-01-02 00:00:00')
Events for 2015 are in: githubarchive:month & githubarchive:day tables. None of them have language information tho (or at least repository_language column).
Can anyone help me?
Upvotes: 3
Views: 367
Reputation: 172994
Look at payload field
It is string that, I think, actually holds JSON with all "missing" attributes
You can process this using JSON Functions
Added Query
Try as below:
SELECT
JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
COUNT(1) AS usage
FROM [githubarchive:month.201601]
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC
Upvotes: 2
Reputation: 59175
What Mikhail said + you can use a query like this:
SELECT JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') language, COUNT(*) c
FROM [githubarchive:month.201501]
GROUP BY 1
ORDER BY 2 DESC
Upvotes: 1