adonis
adonis

Reputation: 33

How to extract the total number of commit pages for a github repository

I'm setting up an script for exporting all commits and pull requests for a bigger list of github repositories (about 4000).

After the basic idea of the script works i need a way to loop through all pages of commits for a repository.

I found out that i can export 100 commits per page. For some repos there is some more commits (like 8000) so that would be 80 pages i need to loop through.

I can't find a way to extract the number of pages from the github api.

What i've done so far is set up the script that it loops through all commits and exports them to a txt / csv file.

What i need to do is to know the total number of pages before i start looping through the commits of a repo.

This here gives me the number of pages in a way that i can't use it.

curl -u "user:password" -I https://api.github.com/repos/0chain/rocksdb/commits?per_page=100

RESULT:

Link: https://api.github.com/repositories/152923130/commits?per_page=100&page=2; rel="next", https://api.github.com/repositories/152923130/commits?per_page=100&page=75; rel="last"

I need the value 75 (or any other value from other repos) to be used as a variable in a loop.

Like so:

repolist=`cat repolist.txt`
repolistarray=($(echo $repolist))
repolength=$(echo "${#repolistarray[@]}")

for (( i = 0; i <= $repolength; i++ )); do
    #here i need to extract the pagenumber
    pagenumber=$(curl -u "user:password" -I https://api.github.com/repos/$(echo "${repolistarray[i]}")/commits?per_page=100)

    for (( n = 1; n <= $pagenumber; n++ )); do
        curl -u "user:password" -s https://api.github.com/repos/$(echo "${repolistarray[i]}")/commits?per_page=100&page$(echo "$n") >committest.txt
    done
done

done

How can I get the "75" or any other result out of this

Link: https://api.github.com/repositories/152923130/commits?per_page=100&page=2; rel="next", https://api.github.com/repositories/152923130/commits?per_page=100&page=75; rel="last"

to be used as "n"?

Upvotes: 2

Views: 1365

Answers (3)

luckman212
luckman212

Reputation: 804

The official GitHub CLI (gh) supports a --paginate flag that does the heavy lifting for you. Combined with jq, you can get the answers you're looking for.

This is simpler, and should be more robust than the other Bash solutions posted earlier.

Examples

Total number of commits in the last 90 days:

gh api --paginate \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"/repos/sindresorhus/awesome/commits?since=$(date -I -v-90d)&per_page=100" |
jq length

Number of commits for the last 6 months, broken down by month, as CSV:

gh api --paginate \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"/repos/sindresorhus/awesome/commits?since=$(date -I -v-6m)&per_page=100" |
jq -r 'map(. + {month: (.commit.committer.date[:7])}) |
group_by(.month)[] | [(.[0].month), length] | @csv'

Output:

"2023-01",1
"2023-02",6
"2023-03",3
"2023-04",5
"2023-05",3
"2023-06",11

Upvotes: 0

webb
webb

Reputation: 4340

Here is something along the lines of what @Poshi commented: loop indefinitely requesting the next page until you hit an empty page, then break out of the inner loop, moving on to the next repo.

# this is the contents of a page past the last real page:
emptypage='[

]'

# here's a simpler way to iterate over each repo than using a bash array
cat repolist.txt | while read -d' ' repo; do

  # loop indefinitely
  page=0
  while true; do
    page=$((page + 1))

    # minor improvement: use a variable, not a file.
    # also, you don't need to echo variables, just use them
    result=$(curl -u "user:password" -s \ 
      "https://api.github.com/repos/$repo/commits?per_page=100&page=$n")

    # if the result is empty, break out of the inner loop
    [ "$result" = "$emptypage" ] && break

    echo "$result" > committest.txt
    # note that > overwrites (whereas >> appends),
    # so committest.txt will be overwritten with each new page.
    #
    # in the final version, you probably want to process the results here,
    # and then
    #
    #       echo "$processed_results"
    #     done > repo1.txt
    #   done
    #
    # to ouput once per repo, or
    #
    #       echo "$processed_results"
    #     done
    #   done > all_results.txt
    #
    # to output all results to a single file

  done
done

Upvotes: 1

Poshi
Poshi

Reputation: 5762

Well, the method you ask for is not the most common one, usually it is done by fetching pages until no more data is available. But to answer your specific question, we must parse the line that contains the information. A quick and dirty way to do this could be:

response="Link: https://api.github.com/repositories/152923130/commits?per_page=100&page=2; rel=\"next\", https://api.github.com/repositories/152923130/commits?per_page=100&page=75; rel=\"last\""

<<< "$response" cut -f2- -d: | # First, get the contents of "Link": everything after the first colon
tr "," $'\n' |      # Separate the different parts in different lines
grep 'rel="last"' | # Select the line with last page information
cut -f1 -d';' |     # Keep only the URL
tr "?&" $'\n' |     # Split URL and its parameters, one per line
grep -e "^page" |   # Select the "page" parameter
cut -f2 -d=         # Finally, extract the number we are interested in

There are some other ways to do this, with less commands, maybe simpler, but this one allows me to go step by step with the explanation. One of these other ways could be:

<<< "$response" sed 's/.*&page=\(.*\); rel="last".*/\1/'

This one makes some assumptions, like the page will always be the last parameter.

Upvotes: 0

Related Questions