brycek
brycek

Reputation: 23

Google BigQuery: How do I get a distinct row for a value in query results

I am trying to use Google BigQuery on the github archive (http://www.githubarchive.org/) data to get the statistics for repositories at the time of their latest event and I am trying to get this for the repositories with the most watchers. I realize this is a lot but I feel like I'm really close to getting it in one query.

This is the query I have now:

SELECT repository_name, repository_owner, repository_organization, repository_size,  repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000

The only problem is that I get all events that are from the highest watched repository (twitter bootstrap):

Result:

Row repository_name repository_owner    repository_organization repository_size watchers    forks   repository_language time     
1   bootstrap           twbs                    twbs                   83875      61191     21602   JavaScript          1384991582000000     
2   bootstrap           twbs                    twbs                   83875      61190     21602   JavaScript          1384991337000000     
3   bootstrap           twbs                    twbs                   83875      61190     21603   JavaScript          1384989683000000

...

How can I just get this to return a single result (the most recent, aka Max(time)) for a repository_name?

I've tried:

SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time
FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at) IN (SELECT MAX(PARSE_UTC_USEC(created_at)) FROM [githubarchive:github.timeline])
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language
ORDER BY watchers DESC, time DESC
LIMIT 1000

Not sure if that would work or not but it doesn't matter because I get the error message:

Error: Join attribute is not defined: PARSE_UTC_USEC

Any help would be great, thanks.

Upvotes: 1

Views: 1394

Answers (1)

Jordan Tigani
Jordan Tigani

Reputation: 26617

One issue with that query is that if there are two operations that both happen at the same time, your results can get confused. You can get what you want if you just group by the repository name to get the max commit time for each repository, and then join against that to get the other fields you want. E.g:

SELECT
  a.repository_name as name,
  a.repository_owner as owner,
  a.repository_organization as organization,
  a.repository_size as size,
  a.repository_watchers AS watchers,
  a.repository_forks AS forks,
  a.repository_language as language,
  PARSE_UTC_USEC(created_at) AS time  
FROM [githubarchive:github.timeline] a
JOIN EACH
  (
     SELECT MAX(created_at) as max_created, repository_name 
     FROM [githubarchive:github.timeline]
     GROUP EACH BY repository_name
  ) b
  ON 
  b.max_created = a.created_at and
  b.repository_name = a.repository_name
ORDER BY watchers desc
LIMIT 1000  

Upvotes: 4

Related Questions