danvk
danvk

Reputation: 16955

How can I get a list of all public GitHub repos with more than 20 stars?

I'd like to get a list of all the public GitHub repos with more than a certain number of stars (say 15 or 20). I can use the GitHub GraphQL API to get a list of repos with greater than 15 stars:

query {
  search(query: "is:public stars:>15", type: REPOSITORY, first:10) {
    repositoryCount
    edges {
      node {
        ... on Repository {
          nameWithOwner
          stargazers {
            totalCount
          }
        }
      }
    }
  }
}

The result looks like this:

{
  "data": {
    "search": {
      "repositoryCount": 704279,
      "edges": [
        { "node": { "nameWithOwner": "freeCodeCamp/freeCodeCamp", "stargazers": { "totalCount": 308427 } } },
        { "node": { "nameWithOwner": "996icu/996.ICU", "stargazers": { "totalCount": 249062 } } },
        { "node": { "nameWithOwner": "vuejs/vue", "stargazers": { "totalCount": 156364 } } },
        { "node": { "nameWithOwner": "facebook/react", "stargazers": { "totalCount": 143121 } } },
        { "node": { "nameWithOwner": "tensorflow/tensorflow", "stargazers": { "totalCount": 140562 } } },
        { "node": { "nameWithOwner": "twbs/bootstrap", "stargazers": { "totalCount": 138369 } } },
        { "node": { "nameWithOwner": "EbookFoundation/free-programming-books", "stargazers": { "totalCount": 136421 } } },
        { "node": { "nameWithOwner": "sindresorhus/awesome", "stargazers": { "totalCount": 125160 } } },
        { "node": { "nameWithOwner": "getify/You-Dont-Know-JS", "stargazers": { "totalCount": 115851 } } },
        { "node": { "nameWithOwner": "ohmyzsh/ohmyzsh", "stargazers": { "totalCount": 102749 } } }
      ]
    }
  }
}

There's 704,279 repos, but I can request up to 100 repos/query and step through the results using a cursor. So it would seem that with enough time this would work. But unfortunately, the GitHub GraphQL API limits you to the first 1,000 results of any query, so this won't do.

I can run multiple queries using ranges of stars (stars:1000..1500) but this breaks down once you get to repos with less starpower (there are over 1,000 repos with exactly 123 stars).

I could break the query down more ways (e.g. by date the repo was created) but this is starting to get crazy. Is there a simpler way to get a complete list of public GitHub repos with 15 or more stars?

Upvotes: 6

Views: 2200

Answers (2)

Ritav Das
Ritav Das

Reputation: 11

import json
import requests

GITHUB_API = "https://api.github.com/search/repositories"
TOKEN = "Your Github Token"

headers = {"Authorization": f"token {TOKEN}"}


def get_repositories_with_min_stars(min_stars):
    page = 1
    params = {
        "q": f"stars:>={min_stars}",
        "sort": "stars",
        "order": "desc",
        "page": page,
        "per_page": 100,
    }
    response = requests.get(GITHUB_API, headers=headers, params=params)
    if response.status_code != 200:
        return 0

    data = json.loads(response.text)
    return data["total_count"]


repositories = get_repositories_with_min_stars(20)
print("Number of repositories with more than equal to 20 stars:", repositories)

Upvotes: 0

danvk
danvk

Reputation: 16955

Splitting by both creation date and star range (the "crazy" solution mentioned in the question) works pretty well in practice.

You can use a GraphQL query like this to get just the count of repos with 15-20 stars that were created in a given date range:

query {
  search(query: "is:public stars:15..20 created:2016-01-01..2016-01-09", type: REPOSITORY, first: 1) {
    repositoryCount
  }
}

response:

{ "data": { "search": { "repositoryCount": 534 } } }

For a given star range (say 15–20), you start with a long date range (say 2007–2020) and get the result count. If it's above 1,000, you split the date range in two and get result counts for each. Keep splitting recursively until each star range/date interval is under 1,000 results.

Here's code to do it:

def split_interval(a, b):
    d = int((b - a) / 2)
    return [(a, a + d), (a + d + 1, b)]

def split_by_days(stars, day_start, day_end):
    start_fmt = day_start.strftime('%Y-%m-%d')
    end_fmt = day_end.strftime('%Y-%m-%d')
    q = f'stars:{stars} created:{start_fmt}..{end_fmt}')
    c = get_count(q)
    if c <= 1000:
        query_repos(q, out_file)
    else:
        days = (day_end - day_start).days
        if days == 0:
            raise ValueError(f'Can\'t split any more: {stars} / {day_start} .. {day_end}')
        for a, b in split_interval(0, days):
            dt_a = day_start + timedelta(days=a)
            dt_b = day_start + timedelta(days=b)
            split_by_days(stars, dt_a, dt_b)

ranges = [
    (15, 20), (21, 25), (26, 30),
    # ...
    (1001, 1500), (1501, 5000), (5001, 1_000_000)
]
for a, b in ranges:
    stars = f'{a}..{b}'
    split_by_days(stars, datetime(2007, 1, 1), datetime(2020, 2, 2))

It's better to scrape from a low star range up since repos are more likely to gain stars than lose them during the scraping process.

For me this wound up requiring 1,102 distinct searches. Here's a CSV file (~50MB) gathered using this approach with all the repos that had 15+ stars on 2020-02-03. See this blog post and the accompanying source code for more details.

Upvotes: 3

Related Questions