Reputation: 23
I am trying to use Google BigQuery on the github archive (http://www.githubarchive.org/) data to get the statistics for repositories at the time of their latest event and I am trying to get this for the repositories with the most watchers. I realize this is a lot but I feel like I'm really close to getting it in one query.
This is the query I have now:
SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000
The only problem is that I get all events that are from the highest watched repository (twitter bootstrap):
Result:
Row repository_name repository_owner repository_organization repository_size watchers forks repository_language time
1 bootstrap twbs twbs 83875 61191 21602 JavaScript 1384991582000000
2 bootstrap twbs twbs 83875 61190 21602 JavaScript 1384991337000000
3 bootstrap twbs twbs 83875 61190 21603 JavaScript 1384989683000000
...
How can I just get this to return a single result (the most recent, aka Max(time)) for a repository_name?
I've tried:
SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at) IN (SELECT MAX(PARSE_UTC_USEC(created_at)) FROM [githubarchive:github.timeline])
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000
Not sure if that would work or not but it doesn't matter because I get the error message:
Error: Join attribute is not defined: PARSE_UTC_USEC
Any help would be great, thanks.
Upvotes: 1
Views: 1394
Reputation: 26617
One issue with that query is that if there are two operations that both happen at the same time, your results can get confused. You can get what you want if you just group by the repository name to get the max commit time for each repository, and then join against that to get the other fields you want. E.g:
SELECT
a.repository_name as name,
a.repository_owner as owner,
a.repository_organization as organization,
a.repository_size as size,
a.repository_watchers AS watchers,
a.repository_forks AS forks,
a.repository_language as language,
PARSE_UTC_USEC(created_at) AS time
FROM [githubarchive:github.timeline] a
JOIN EACH
(
SELECT MAX(created_at) as max_created, repository_name
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name
) b
ON
b.max_created = a.created_at and
b.repository_name = a.repository_name
ORDER BY watchers desc
LIMIT 1000
Upvotes: 4