Reputation: 11439
I want to retrieve the latest infos about a repository using Google Big Query on the github archive timeline dataset.
I tried to join on max(created_at) but i get vastly incomplete informations. Here is the query for the rails repo :
SELECT *
FROM [githubarchive:github.timeline] a
JOIN EACH
(
SELECT MAX(created_at) as max_created, repository_url
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_url
) b
ON
b.max_created = a.created_at and
b.repository_url = a.repository_url
WHERE payload_ref_type="repository" AND a.repository_name = 'rails'
I know this dataset doesn't include data before 2011, but it should contain informations about recently active repositories
I don't understand the results of this query :
Is my query false in some way, why don't it return PushEvents ? Is there another trick about the github archive dataset ?
Upvotes: 1
Views: 286
Reputation: 207863
If you want to retrieve the latest version of a row, you need to use Window functions.
You can get the order of the modifications based on the created_at
timestamp column using.
SELECT *
FROM
(SELECT a.*,
row_number() over (partition BY repository_url
ORDER BY created_at DESC) AS seq_num
FROM [githubarchive:github.timeline] a
WHERE payload_ref_type="repository"
AND a.repository_name = 'rails') d
WHERE seq_num=1 LIMIT 10
seq_num=1
means it will grab the first entry from that specific partition, and since we ordered descending that will be the most recent entry.
Keeping payload_ref_type="repository
in the query would return only 1050 repos, but if you remove that part you will have more.
In the same style you could also leverage data from day and monthly events. Those are in separate table and can be used with table wildcard functions.
Upvotes: 2