Reputation: 8591
Sometimes, the original GitHub repository of a piece of software I'm using, such as linkchecker, is seeing little or no development, while a lot of forks have been created (in this case: 142, at the time of writing).
For each fork, I'd like to know:
and for each such branch:
GitHub has a web interface for comparing forks, but I don't want to do this manually for each fork, I just want a CSV file with the results for all forks. How can this be scripted? The GitHub API can list the forks, but I can't see how to compare forks with it. Cloning every fork in turn and doing the comparison locally seems a bit crude.
Upvotes: 50
Views: 7460
Reputation: 2428
Update: Using the following bookmarklet twice (once to navigate to the forks tree, and once to analyze the forks) prints the info directly onto the web page like this:
Add a new bookmark and paste all the following code as the URL. On Chrome, you can select the code and drag&drop it into your bookmarks list.
javascript:(async () => {
/* Source: github.com/coezbek/github-fork-bookmarklet */
/* Function to fetch the default branch of a repository */
const getDefaultBranch = async (user, repo) => {
const response = await fetch(`https://api.github.com/repos/${user}/${repo}`);
if (!response.ok) throw new Error('Failed to retrieve repository information.');
return (await response.json()).default_branch;
};
/* Function to fetch branch information for a given user, repo, and branch */
const getBranchInfo = async (user, repo, branch) => {
try {
const response = await fetch(`https://github.com/${user}/${repo}/branch-infobar/${branch}`, { headers: { accept: 'application/json' } });
return response.ok ? (await response.json()).refComparison : null;
} catch (error) {
console.error(`Error fetching branch info for ${user}/${repo}:`, error);
return null;
}
};
try {
/* Ensure the script runs on a GitHub repository page */
const match = window.location.href.match(/^https:\/\/github\.com\/([^/]+)\/([^/]+)(\/network\/members\/?)?/);
if (!match) {
alert('Run this from a GitHub repository page.');
return;
}
if (!match[3]) {
window.location.href = `https://github.com/${match[1]}/${match[2]}/network/members`;
}
const [_, mainUser, mainRepo] = match;
const defaultBranch = await getDefaultBranch(mainUser, mainRepo);
/* Collect all fork links excluding the original repo */
const forkLinks = [...document.querySelectorAll('div.repo a:last-of-type')].slice(1);
for (const link of forkLinks) {
try {
/* Extract user and repository name from the link */
const [_, user, repo] = link.href.match(/github\.com\/([^/]+)\/([^/]+)/) || [];
if (!user || !repo) continue;
/* Attempt to fetch branch information, with fallback to repo's default branch */
let branchInfo = await getBranchInfo(user, repo, defaultBranch);
if (!branchInfo) branchInfo = await getBranchInfo(user, repo, await getDefaultBranch(user, repo));
/* Check downstream forks for modifications */
const childLinks = [...link.closest('.repo').querySelectorAll('.network-tree + a')];
const childrenHaveMods = await Promise.all(childLinks.map(async (childLink) => {
const [_, childUser, childRepo] = childLink.href.match(/github\.com\/([^/]+)\/([^/]+)/) || [];
if (childUser && childRepo) {
const childBranchInfo = await getBranchInfo(childUser, childRepo, defaultBranch);
return childBranchInfo && childBranchInfo.ahead > 0;
}
return false;
})).then(results => results.some(Boolean));
/* Hide or show branch details for forks based on modifications */
if (branchInfo) {
const { ahead, behind } = branchInfo;
if (ahead === 0 && !childrenHaveMods) {
link.closest('.repo').style.display = 'none';
} else {
const branchDetails = `Ahead: <font color="#0c0">${ahead}</font>, Behind: <font color="red">${behind}</font>`;
link.insertAdjacentHTML('afterend', ` - ${branchDetails}`);
}
}
} catch (error) {
console.error(`Error processing link: ${link.href}`, error);
}
}
} catch (error) {
console.error('Error in bookmarklet execution:', error);
}
})();
Alternatively, you can paste the code into the address bar, but note that some browsers delete the leading javascript:
while pasting, so you'll have to type javascript:
yourself. Or copy everything except the leading j
, type j
, and paste the rest.
Alternatively, you can paste the code into the DevTools console.
It has been modified from this answer.
Upvotes: 76
Reputation: 477
Here's a Python script using the GitHub API. I wanted to include the date and last commit message. You'll need to include a Personal Access Token (PAT) if you need a bump to 5k requests/hr.
USAGE: python3 list-forks.py https://github.com/itinance/react-native-fs
Example Output:
https://github.com/itinance/react-native-fs root 2021-11-04 "Merge pull request #1016 from mjgallag/make-react-native-windows-peer-dependency-optional make react-native-windows peer dependency optional"
https://github.com/AnimoApps/react-native-fs diverged +2 -160 [+1m 10d] "Improved comments to align with new PNG support in copyAssetsFileIOS"
https://github.com/twinedo/react-native-fs ahead +1 [+26d] "clear warn yellow new NativeEventEmitter()"
https://github.com/synonymdev/react-native-fs ahead +2 [+23d] "Merge pull request #1 from synonymdev/event-emitter-fix Event Emitter Fix"
https://github.com/kongyes/react-native-fs ahead +2 [+10d] "aa"
https://github.com/kamiky/react-native-fs diverged +1 -2 [-6d] "add copyCurrentAssetsVideoIOS function to retrieve current modified videos"
https://github.com/nikola166/react-native-fs diverged +1 -2 [-7d] "version"
https://github.com/morph3ux/react-native-fs diverged +1 -4 [-30d] "Update package.json"
https://github.com/broganm/react-native-fs diverged +2 -4 [-1m 7d] "Update RNFSManager.m"
https://github.com/k1mmm/react-native-fs diverged +1 -4 [-1m 14d] "Invalidate upload session Prevent memory leaks"
https://github.com/TickKleiner/react-native-fs diverged +1 -4 [-1m 24d] "addListener and removeListeners methods wass added to pass warning"
https://github.com/nerdyfactory/react-native-fs diverged +1 -8 [-2m 14d] "fix: applying change from https://github.com/itinance/react-native-fs/pull/944"
import requests, re, os, sys, time, json, datetime
from dateutil.relativedelta import relativedelta
from urllib.parse import urlparse
GITHUB_PAT = 'ghp_abcdef123456789'
def json_from_url(url):
response = requests.get(url, headers={ 'Authorization': 'token {}'.format(GITHUB_PAT) })
return response.json()
def date_delta_to_text(date1, date2) -> str:
ret = []
date_delta = relativedelta(date2, date1)
sign = '+' if date1 < date2 else '-'
if date_delta.years != 0:
ret.append('{}y'.format(abs(date_delta.years)))
if date_delta.months != 0:
ret.append('{}m'.format(abs(date_delta.months)))
if date_delta.days != 0:
ret.append('{}d'.format(abs(date_delta.days)))
if date_delta.years == 0 and \
date_delta.months == 0 and \
date_delta.days == 0:
sign = ''
ret.append('0d')
return '{}{}'.format(sign, ' '.join(ret))
def iso8601_date_to_date(date):
return datetime.datetime.strptime(date, '%Y-%m-%dT%H:%M:%SZ')
def date_to_text(date):
return date.strftime('%Y-%m-%d')
def process_repo(repo_author, repo_name, branch_name, fork_of_fork):
page = 1
while 1:
forks_url = 'https://api.github.com/repos/{}/{}/forks?per_page=100&page={}'.format(repo_author, repo_name, page)
forks_json = json_from_url(forks_url)
if not forks_json:
break
for fork_info in forks_json:
fork_author = fork_info['owner']['login']
fork_name = fork_info['name']
forks_count = fork_info['forks_count']
fork_url = 'https://github.com/{}/{}'.format(fork_author, fork_name)
compare_url = 'https://api.github.com/repos/{}/{}/compare/{}...{}:{}'.format(repo_author, fork_name, branch_name, fork_author, branch_name)
compare_json = json_from_url(compare_url)
if 'status' in compare_json:
items = []
status = compare_json['status']
ahead_by = compare_json['ahead_by']
behind_by = compare_json['behind_by']
total_commits = compare_json['total_commits']
commits = compare_json['commits']
if fork_of_fork:
items.append(' ')
items.append(fork_url)
items.append(status)
if ahead_by != 0:
items.append('+{}'.format(ahead_by))
if behind_by != 0:
items.append('-{}'.format(behind_by))
if total_commits > 0:
last_commit = commits[total_commits-1];
commit = last_commit['commit']
author = commit['author']
date = iso8601_date_to_date(author['date'])
items.append('[{}]'.format(date_delta_to_text(root_date, date)))
items.append('"{}"'.format(commit['message'].replace('\n', ' ')))
if ahead_by > 0:
print(' '.join(items))
if forks_count > 0:
process_repo(fork_author, fork_name, branch_name, True)
page += 1
def get_commits_json(root_author, root_name, branch_name):
commits_url = 'https://api.github.com/repos/{}/{}/commits/{}'.format(root_author, root_name, branch_name)
return json_from_url(commits_url)
url_parsed = urlparse(sys.argv[1].strip())
path_array = url_parsed.path.split('/')
root_author = path_array[1]
root_name = path_array[2]
branch_name = 'master'
root_url = 'https://github.com/{}/{}'.format(root_author, root_name)
commits_json = get_commits_json(root_author, root_name, branch_name)
if commits_json['message'] == 'No commit found for SHA: master':
branch_name = 'main'
commits_json = get_commits_json(root_author, root_name, branch_name)
commit = commits_json['commit']
author = commit['author']
root_date = iso8601_date_to_date(author['date'])
print('{} root {} "{}"'.format(root_url, date_to_text(root_date), commit['message'].replace('\n', ' ')));
process_repo(root_author, root_name, branch_name, False)
Upvotes: 1
Reputation: 5229
useful-forks is an online tool which filters all the forks based on ahead
criteria. I think it answers your needs quite well. :)
For the repo in your question, you could do: https://useful-forks.github.io/?repo=wummel/linkchecker
That should provide you with similar results to (ran on 2022-04-02):
Download it here: https://chrome.google.com/webstore/detail/useful-forks/aflbdmaojedofngiigjpnlabhginodbf
Add this as the URL of a new bookmark, and click that bookmark when you're on a repo:
javascript:!function(){if(m=window.location.href.match(/github\.com\/([\w.-]+)\/([\w.-]+)/),m){window.open(`https://useful-forks.github.io/?repo=${m[1]}/${m[2]}`)}else window.alert("Not a GitHub repo")}();
Although to be honest, it's a better option to simply get the Chrome Extension, if you can.
I am the maintainer of this project.
Upvotes: 27
Reputation: 2428
Here's a Python script for listing and cloning all forks that are ahead.
It doesn't use the API. So it doesn't suffer from a rate limit and doesn't require authentication. But it might require adjustments if the GitHub website design changes.
Unlike the bookmarklet in the other answer that shows links to ZIP files, this script also saves info about the commits because it uses git clone
and also creates a commits.htm
file with the overview.
import requests, re, os, sys, time
def content_from_url(url):
# TODO handle internet being off and stuff
text = requests.get(url).content
return text
ENCODING = "utf-8"
def clone_ahead_forks(forklist_url):
forklist_htm = content_from_url(forklist_url).decode(ENCODING)
with open("forklist.htm", "w", encoding=ENCODING) as text_file:
text_file.write(forklist_htm)
is_root = True
# not working if there are no forks: '<a class="(Link--secondary)?" href="(/([^/"]*)/[^/"]*)">'
for match in re.finditer('<a (class=""|data-pjax="#js-repo-pjax-container") href="(/([^/"]*)/[^/"]*)">', forklist_htm):
fork_url = 'https://github.com'+match.group(2)
fork_owner_login = match.group(3)
fork_htm = content_from_url(fork_url).decode(ENCODING)
match2 = re.search('([0-9]+ commits? ahead(, [0-9]+ commits? behind)?)', fork_htm)
# TODO check whether 'ahead'/'behind'/'even with' appear only once on the entire page - in that case they are not part of the readme, "About" box, etc.
sys.stdout.write('.')
if match2 or is_root:
if match2:
aheadness = match2.group(1) # for example '1 commit ahead, 2 commits behind'
else:
aheadness = 'root repo'
is_root = False # for subsequent iterations
dir = fork_owner_login+' ('+aheadness+')'
print(dir)
if not os.path.exists(dir):
os.mkdir(dir)
os.chdir(dir)
# save commits.htm
commits_htm = content_from_url(fork_url+'/commits').decode(ENCODING)
with open("commits.htm", "w", encoding=ENCODING) as text_file:
text_file.write(commits_htm)
# git clone
os.system('git clone '+fork_url+'.git')
print
# no need to recurse into forks of forks because they are all listed on the initial page and being traversed already
os.chdir('..')
else:
print(dir+' already exists, skipping.')
base_path = os.getcwd()
match_disk_letter = re.search(r'^([a-zA-Z]:\\)', base_path)
with open('repo_urls.txt') as url_file:
for url in url_file:
url = url.strip()
url = re.sub(r'\?[^/]*$', '', url) # remove stings like '?utm_source=...' from the end
print(url)
match = re.search('github.com/([^/]*)/([^/]*)$', url)
if match:
user_name = match.group(1)
repo_name = match.group(2)
print(repo_name)
dirname_for_forks = repo_name+' ('+user_name+')'
if not os.path.exists(dirname_for_forks):
url += "/network/members" # page that lists the forks
TMP_DIR = 'tmp_'+time.strftime("%Y%m%d-%H%M%S")
if match_disk_letter: # if Windows, i.e. if path starts with A:\ or so, run git in A:\tmp_... instead of .\tmp_..., in order to prevent "filename too long" errors
TMP_DIR = match_disk_letter.group(1)+TMP_DIR
print(TMP_DIR)
os.mkdir(TMP_DIR)
os.chdir(TMP_DIR)
clone_ahead_forks(url)
print
os.chdir(base_path)
os.rename(TMP_DIR, dirname_for_forks)
else:
print(dirname_for_forks+' ALREADY EXISTS, SKIPPING.')
print('DONE.')
If you make the file repo_urls.txt
with the following content (you can put several URLs, one URL per line):
https://github.com/cifkao/tonnetz-viz
then you'll get the following directories each of which contains the respective cloned repo:
tonnetz-viz (cifkao)
bakaiadam (2 commits ahead)
chumo (2 commits ahead, 4 commits behind)
cifkao (root repo)
codedot (76 commits ahead, 27 commits behind)
k-hatano (41 commits ahead)
shimafuri (11 commits ahead, 8 commits behind)
If it doesn't work, try earlier versions.
Upvotes: 0
Reputation: 2428
Here's a Python script for listing and cloning the forks that are ahead. This script partially uses the API, so it triggers the rate limit (you can extend the rate limit (not infinitely) by adding GitHub API authentication to the script, please edit or post that).
Initially I tried to use the API entirely, but that triggered the rate limit too fast, so now I use is_fork_ahead_HTML
instead of is_fork_ahead_API
. This might require adjustments if the GitHub website design changes.
Due to the rate limit, I prefer the other answers that I posted here.
import requests, json, os, re
def obj_from_json_from_url(url):
# TODO handle internet being off and stuff
text = requests.get(url).content
obj = json.loads(text)
return obj, text
def is_fork_ahead_API(fork, default_branch_of_parent):
""" Use the GitHub API to check whether `fork` is ahead.
This triggers the rate limit, so prefer the non-API version below instead.
"""
# Compare default branch of original repo with default branch of fork.
comparison, comparison_json = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo+'/compare/'+default_branch_of_parent+'...'+fork['owner']['login']+':'+fork['default_branch'])
if comparison['ahead_by']>0:
return comparison_json
else:
return False
def is_fork_ahead_HTML(fork):
""" Use the GitHub website to check whether `fork` is ahead.
"""
htm = requests.get(fork['html_url']).content
match = re.search('<div class="d-flex flex-auto">[^<]*?([0-9]+ commits? ahead(, [0-9]+ commits? behind)?)', htm)
# TODO if website design changes, fallback onto checking whether 'ahead'/'behind'/'even with' appear only once on the entire page - in that case they are not part of the username etc.
if match:
return match.group(1) # for example '1 commit ahead, 114 commits behind'
else:
return False
def clone_ahead_forks(user,repo):
obj, _ = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo)
default_branch_of_parent = obj["default_branch"]
page = 0
forks = None
while forks != [{}]:
page += 1
forks, _ = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo+'/forks?per_page=100&page='+str(page))
for fork in forks:
aheadness = is_fork_ahead_HTML(fork)
if aheadness:
#dir = fork['owner']['login']+' ('+str(comparison['ahead_by'])+' commits ahead, '+str(comparison['behind_by'])+'commits behind)'
dir = fork['owner']['login']+' ('+aheadness+')'
print dir
os.mkdir(dir)
os.chdir(dir)
os.system('git clone '+fork['clone_url'])
print
# recurse into forks of forks
if fork['forks_count']>0:
clone_ahead_forks(fork['owner']['login'], fork['name'])
os.chdir('..')
user = 'cifkao'
repo = 'tonnetz-viz'
clone_ahead_forks(user,repo)
Upvotes: 0
Reputation: 186
Late to the party - I think this is the second time I've ended up on this SO post so I'll share my js-based solution (I ended up making a bookmarklet by just fetching and searching the html pages). You can either create a bookmarklet from this, or simply paste the whole thing into the console. Works on chromium-based and firefox:
EDIT: if there are more than 10 or so forks on the page, you may get locked out for scraping too fast (429 too many requests in network). Use async / await instead:
javascript:(async () => {
/* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);
for (const fork of forks) {
/* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it to console */
await fetch(fork)
.then(x => x.text())
.then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
.catch(console.error);
}
})();
or you can do batches, but it's pretty easy to get locked out
javascript:(async () => {
/* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);
getfork = (fork) => {
return fetch(fork)
.then(x => x.text())
.then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
.catch(console.error);
}
while (forks.length) {
await Promise.all(forks.splice(0, 2).map(getfork));
}
})();
Original (this fires all requests at once and will possibly lock you out if it is more requests/s than github allows)
javascript:(() => {
/* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);
for (const fork of forks) {
/* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it to console */
fetch(fork)
.then(x => x.text())
.then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
.catch(console.error);
}
})();
Will print something like:
https://github.com/user1/repo: 289 commits behind original:master.
https://github.com/user2/repo: 489 commits behind original:master.
https://github.com/user2/repo: 1 commit ahead, 501 commits behind original:master.
...
to console.
EDIT: replaced comments with block comments for paste-ability
Upvotes: 3
Reputation: 8591
active-forks doesn't quite do what I want, but it comes close and is very easy to use.
Upvotes: 2
Reputation: 214
Had exactly the same itch and wrote a scraper that takes the info printed in the rendered HTML for forks: https://github.com/hbbio/forkizard
Definitely not perfect, but a temporary solution.
Upvotes: 10