user305883
user305883

Reputation: 1741

select repository url from github_timeline: group pushes and downloads by repos, and filter repos by stargazers

i don't know sql, i tried to document my self but could not achieve the result I want.

I am looking at this query on Bigquery, against Github_timeline:

SELECT repository_url, actor_attributes_login
FROM [githubarchive:github.timeline]
WHERE type='WatchEvent' AND actor_attributes_login IN (
  SELECT actor_attributes_login FROM [githubarchive:github.timeline]
  WHERE type='WatchEvent'
  GROUP BY actor_attributes_login HAVING (count(*) > 1) AND (count (*) < 500)
)
GROUP EACH BY repository_url, actor_attributes_login;

found here: https://github.com/anvaka/ghindex

From my understanding, the query constraint results to repos with 1 < stargazers < 500.

I would like to:fetch data in one shot from multiple values of Type: (for who's interesting, please find a description of types here: https://developer.github.com/v3/activity/events/types/)

I would like to:

I tried to group lines by repository_url, followed by the stargazers with the limit 1

SELECT repository_url, actor_attributes_login, type
FROM [githubarchive:github.timeline]
WHERE (type='PushEvent'OR type='WatchEvent') AND actor_attributes_login IN (
  SELECT repository_url, actor_attributes_login FROM [githubarchive:github.timeline]
  WHERE (type='WatchEvent' or type='PushEvent')
  GROUP BY repository_url, actor_attributes_login HAVING (count(*) > 1) AND (count (*) < 500) 
)
GROUP EACH BY repository_url, actor_attributes_login, type
LIMIT 100;

but got error:

Error: Right query in semi-join must have exactly one field selected.

I also tried to simplify and gather multiple variable from field TYPE, without attempt to group by repository_url; (here I only used AND actor_attributes_login=='author' to restrict number of results, as a test):

SELECT repository_url, actor_attributes_login, type
FROM [githubarchive:github.timeline]
WHERE (type='WatchEvent') AND actor_attributes_login IN (
  SELECT actor_attributes_login FROM [githubarchive:github.timeline]
  WHERE (type='WatchEvent' OR type='PushEvent' OR type='DownloadEvent' OR type='IssueCommentEvent') AND actor_attributes_login=='author'
  GROUP BY actor_attributes_login HAVING (count(*) > 1) AND (count (*) < 500)
)
GROUP EACH BY repository_url, actor_attributes_login, type LIMIT 100;

But:

Query returned zero records.

Could you help in understanding what am I doing wrong, in order to:

Possibly I would like to combine the above queries the with the constraint applied to the number of users involved in a WatchEvent: - fetch all stars gazed to a repo (that is, all the actor_attributes_login in watchEvents), with a constraint to 1

But eventually I could do the last part in post-processing, to reduce complexity. Thank you for help!

Upvotes: 1

Views: 324

Answers (1)

Mosha Pasumansky
Mosha Pasumansky

Reputation: 13994

Perhaps I misunderstood your problem statement, but I think the following SQL will do what you want:

SELECT a.repository_url, a.actor_attributes_login, a.type
FROM [githubarchive:github.timeline] a
JOIN EACH
(SELECT actor_attributes_login FROM [githubarchive:github.timeline]
 WHERE type IN ('WatchEvent', 'PushEvent')
 GROUP BY actor_attributes_login HAVING (count(*) BETWEEN 1 AND 500)
) b
ON a.actor_attributes_login = b.actor_attributes_login
GROUP EACH BY 1,2,3 LIMIT 100;

Upvotes: 2

Related Questions