vdaubry
vdaubry

Reputation: 11439

Getting latest repository infos from Github Archive

I want to retrieve the latest infos about a repository using Google Big Query on the github archive timeline dataset.

I tried to join on max(created_at) but i get vastly incomplete informations. Here is the query for the rails repo :

SELECT *
FROM [githubarchive:github.timeline] a
JOIN EACH
  (
     SELECT MAX(created_at) as max_created, repository_url
     FROM [githubarchive:github.timeline]
     GROUP EACH BY repository_url
  ) b
  ON 
  b.max_created = a.created_at and
  b.repository_url = a.repository_url
WHERE payload_ref_type="repository" AND a.repository_name = 'rails'

I know this dataset doesn't include data before 2011, but it should contain informations about recently active repositories

I don't understand the results of this query :

Is my query false in some way, why don't it return PushEvents ? Is there another trick about the github archive dataset ?

Upvotes: 1

Views: 286

Answers (1)

Pentium10
Pentium10

Reputation: 207863

If you want to retrieve the latest version of a row, you need to use Window functions.

You can get the order of the modifications based on the created_at timestamp column using.

SELECT *
FROM
  (SELECT a.*,
          row_number() over (partition BY repository_url
                             ORDER BY created_at DESC) AS seq_num
   FROM [githubarchive:github.timeline] a
   WHERE payload_ref_type="repository"
     AND a.repository_name = 'rails') d
WHERE seq_num=1 LIMIT 10

seq_num=1 means it will grab the first entry from that specific partition, and since we ordered descending that will be the most recent entry.

Keeping payload_ref_type="repository in the query would return only 1050 repos, but if you remove that part you will have more.

In the same style you could also leverage data from day and monthly events. Those are in separate table and can be used with table wildcard functions.

Upvotes: 2

Related Questions