Reputation: 20633
So, I'm trying to do some queries using bigquery-public-data:github_repos.files
, which was updated on May 25, 2018, 2:07:03 AM
, in theory, it contains all files data from github - as it says in the description of the table:
File metadata for all files at HEAD.
Join with [bigquery-public-data:github_repos.contents] on id columns to search text.
So, I have this tool called goreleaser, to use it, users create a file named .goreleaser.yaml. To have an idea of how many repositories are using it, I was using the github search, something like this a search for filename:goreleaser extension:yaml extension:yml path:/
, you can see the results on this link.
This shows 1k+ results, and gets results for all these possible names:
goreleaser.yml
goreleaser.yaml
.goreleaser.yml
.goreleaser.yaml
The problem is, github shows the 1k result count, but you can only paginate until 1k or so. I wrote some code in Go using the API and etc, you see it here.
Anyway, I tried to do something similar with bigquery, here is my foolish attempt:
SELECT repo_name, path
FROM [bigquery-public-data:github_repos.files]
WHERE REGEXP_MATCH(path, r'\.?goreleaser.ya?ml')
This will include the vendored tools, which is not ok, but that's not the problem. The problem is that even with the vendored tools, it only shows ~500 results, not 1k.
PS: I also tried the simplified version matching path
with LIKE
and etc, same results.
So, either I'm doing something horribly wrong, this table does not include all data as it says it does or github search is lying to me.
Any advice?
Thanks!
Upvotes: 2
Views: 302
Reputation: 59175
Not every project in GitHub is mirrored on BigQuery's repo dataset.
Let's look at all projects that got more than 40 stars in April, vs what we can find mirrored in BigQuery's repos:
SELECT COUNT(name) april_projects_gt_stars, COUNT(repo_name) projects_mirrored
FROM (
SELECT DISTINCT repo_name, name, c
FROM `bigquery-public-data.github_repos.files` a
RIGHT JOIN (
SELECT repo.name, COUNT(*) c
FROM `githubarchive.month.201804`
WHERE type='WatchEvent'
GROUP BY 1
HAVING c>40
) b
ON repo_name=name
)
9522 vs 3995. Why?
Upvotes: 3