caarlos0
caarlos0

Reputation: 20633

BigQuery github dataset returns wrong results

So, I'm trying to do some queries using bigquery-public-data:github_repos.files, which was updated on May 25, 2018, 2:07:03 AM, in theory, it contains all files data from github - as it says in the description of the table:

File metadata for all files at HEAD.

Join with [bigquery-public-data:github_repos.contents] on id columns to search text.

So, I have this tool called goreleaser, to use it, users create a file named .goreleaser.yaml. To have an idea of how many repositories are using it, I was using the github search, something like this a search for filename:goreleaser extension:yaml extension:yml path:/, you can see the results on this link.

This shows 1k+ results, and gets results for all these possible names:

The problem is, github shows the 1k result count, but you can only paginate until 1k or so. I wrote some code in Go using the API and etc, you see it here.

Anyway, I tried to do something similar with bigquery, here is my foolish attempt:

SELECT repo_name, path
FROM [bigquery-public-data:github_repos.files]
WHERE REGEXP_MATCH(path, r'\.?goreleaser.ya?ml')

This will include the vendored tools, which is not ok, but that's not the problem. The problem is that even with the vendored tools, it only shows ~500 results, not 1k.

PS: I also tried the simplified version matching path with LIKE and etc, same results.

So, either I'm doing something horribly wrong, this table does not include all data as it says it does or github search is lying to me.

Any advice?

Thanks!

Upvotes: 2

Views: 302

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

Not every project in GitHub is mirrored on BigQuery's repo dataset.

Let's look at all projects that got more than 40 stars in April, vs what we can find mirrored in BigQuery's repos:

SELECT COUNT(name) april_projects_gt_stars, COUNT(repo_name) projects_mirrored
FROM (
  SELECT DISTINCT repo_name, name, c
  FROM `bigquery-public-data.github_repos.files` a
  RIGHT JOIN (
    SELECT repo.name, COUNT(*) c
    FROM `githubarchive.month.201804` 
    WHERE type='WatchEvent'
    GROUP BY 1
    HAVING c>40
  ) b
  ON repo_name=name
)

enter image description here

9522 vs 3995. Why?

  • Only open source projects are mirrored. This according to the open source detected license - if GitHub can't tell what license a project is using, the project can't be mirrored.
  • New projects: The pipeline might miss some new projects. Please report them.

Upvotes: 3

Related Questions