gitcub
gitcub

Reputation: 13

GitHub BigQuery commits over time query not returning results for certain repos

I'm trying to pull data about GitHub monthly commits over time using the public dataset at Google BigQuery. The following query provided results for Chef and Ansible but returned nothing for Puppet or Salt.

SELECT
  MONTH(committer.date) month,
  YEAR(committer.date) year,
  repo_name,
  COUNT(*) commits,
FROM 
  [bigquery-public-data:github_repos.commits]
WHERE
  repo_name IN ('puppetlabs/puppet',
  'saltstack/salt',
  'ansible/ansible',
  'chef/chef')
GROUP BY
  month,
  year,
  repo_name

I then attempted to revise the query to pull only Salt or Puppet by revising the WHERE clause to:

WHERE
    repo_name = 'puppetlabs/puppet'

(I also repeated this with 'saltstack/salt' as a separate query.) In each case I received the error message:

'Query returned zero records.'

I have tried to troubleshoot by:
1) confirming that I am using the correct repo names
2) confirming that the repos are public and should (in theory) be included in the BigQuery data and
3) I have tied the query results for Ansible and Chef back to the commits on github.com, and the query in those cases gave accurate results.

Does anyone have any ideas about where the issue lies and how I can modify my query to return data for Salt and Puppet?

Upvotes: 1

Views: 505

Answers (2)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

puppetlabs/puppet is not open source, at least as determined by GitHub's License API:

curl -H "Accept: application/vnd.github.drax-preview+json" \
     https://api.github.com/repos/puppetlabs/puppet |grep license -A 6

"license": {
  "key": "other",
  "name": "Other",
  "spdx_id": null,
  "url": null,
  "featured": false
},

Documentation for the API:

puppetlabs/puppet LICENSE:

It looks like an Apache License 2.0 to me, but it won't be included on the GitHub BigQuery repository until the GitHub License API can determine that this is in fact an open source license.

Note that GitHub uses licensee to power their API, and this is how they say their algorithm runs:

If the license file has an explicit copyright notice, and nothing more (e.g., Copyright (c) 2015 Ben Balter), we'll assume the author intends to retain all rights, and thus the project isn't licensed.

If the license is an exact match to a known license. If we strip away whitespace and copyright notice, we might get lucky, and direct string comparison in Ruby is cheap.

If we still can't match the license, we use a fancy math thing called the Sørensen–Dice coefficient, which is really good at calculating the similarity between two strings. By calculating the percent changed from the known license to the license file, you can tell, e.g., that a given license is 90% similar to the MIT license, that 10% likely representing the copyright line being properly adapted to the project.

Now, if you are trying to get their commit info over time, you could use the GitHub Archive BigQuery dataset:

SELECT type, COUNT(*) c
FROM [githubarchive:month.201607]
WHERE repo.name = 'puppetlabs/puppet'
AND type='PushEvent'
GROUP BY 1

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172994

Run below to see all from puppetlabs for example

SELECT repo_name, COUNT(1) commits
FROM [bigquery-public-data:github_repos.commits]
WHERE repo_name LIKE 'puppetlabs/%' 
GROUP BY repo_name
ORDER BY commits DESC

There are quite a number!

Upvotes: 0

Related Questions