Basia F
Basia F

Reputation: 31

GitHub Archive Google Big Query repositories language information for 2015

I have a problem with retrieving language information from GitHub Archive Google BigQuery since the structure of the tables changed which was at the beginning of 2015.

When querying github_timeline table I have a field named repository_language. It allows me to get my language statistics. Unfortunately for 2015 the structure has changed and the table doesn't contain any events after 2014. For example the following query doesn't return any data:

select
repository_language, repository_url, created_at
FROM [githubarchive:github.timeline]
where
PARSE_UTC_USEC(created_at) > PARSE_UTC_USEC('2015-01-02 00:00:00')

Events for 2015 are in: githubarchive:month & githubarchive:day tables. None of them have language information tho (or at least repository_language column).

Can anyone help me?

Upvotes: 3

Views: 367

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172994

Look at payload field
It is string that, I think, actually holds JSON with all "missing" attributes

You can process this using JSON Functions

enter image description here

Added Query

Try as below:

SELECT 
  JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
  COUNT(1) AS usage
FROM [githubarchive:month.201601] 
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC

Upvotes: 2

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

What Mikhail said + you can use a query like this:

SELECT JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') language, COUNT(*) c
FROM [githubarchive:month.201501]
GROUP BY 1
ORDER BY 2 DESC

Upvotes: 1

Related Questions