Reputation: 11373
I was asked this question on #git
earlier but as its reasonably substantial I'll post it up here. I want to run a filter-branch
on a repo to modify (thousands of) files over hundreds of commits using a python script. I'm calling the clean.py
script using the following command in the repo directory:
git filter-branch -f --tree-filter '(cd ../cleaner/ && python clean.py --path=files/*/*/**)'
Clean.py looks like this and will modify all files in path (i.e. files/*/*/**
):
from os import environ as environment
import argparse, yaml
import logging
from cleaner import Cleaner
parser = argparse.ArgumentParser()
parser.add_argument("--path", help="path to run cleaner on", type=str)
args = parser.parse_args()
# logging.basicConfig(level=logging.DEBUG)
with open("config.yml") as sets:
config = yaml.load(sets)
path = args.path
if not path:
path = config["cleaner"]["general_pattern"]
cleaner = Cleaner(config["cleaner"])
print "Cleaning path: " + str(path)
cleaner.clean(path, True)
After running the command the following is outputted to terminal:
$ python deploy.py --verbose
INFO:root:Checked out master branch
INFO:root:Running command:
'git filter-branch -f --tree-filter '(cd C:/Users/Graeme/Documents/programming/clean-cdn/clean-jsdelivr/ && python clean.py --path=files/*/*/**)' -d "../tmp"' in ../jsdelivr
Rewrite 298ec3a2ca5877a25ebd40aeb815d7b5a5f33a7e (1/1535)
Cleaning path: files/*/*/**
C:\Program Files (x86)\git/libexec/git-core\git-filter-branch: line 343: ../commit: No such file or directory
C:\Program Files (x86)\git/libexec/git-core\git-filter-branch: line 346: ../map/298ec3a2ca5877a25ebd40aeb815d7b5a5f33a7e
: No such file or directory
could not write rewritten commit
rm: cannot remove `/c/Users/Graeme/Documents/programming/clean-cdn/tmp/revs': Permission denied
rm: cannot remove directory `/c/Users/Graeme/Documents/programming/clean-cdn/tmp': Directory not empty
The python script executes successfully and modifies the files correctly but the filter-branch
doesn't finish fixing up the commit. There appears to be a permission issue however I haven't been able to get around it running with elevated privileges. I've tried running the filter-branch on win7, win8, and ubuntu with git v1.8 and v1.9.
Edit The script works as is on Centros with git1.7.1
The goal is to reduce the size of a CDNs repo (nearing 1GB) after the contents in files/*/*/**
finishes syncing with a database.
The source code of the project
Target repo for the rewrite
Upvotes: 4
Views: 1529
Reputation: 25314
The permissions issue you're encountering is interesting-are you doing this on a local copy of the repo (ie one where you have full access to the filesystem), or on a remote server?
Reading over your python code, it looks like you're trying to remove every file over a certain size that is not a .INI file, did I get that right?
If that's the case, can I ask if you've considered The BFG Repo-Cleaner? Obviously, you learn a lot about Git by writing your own code (I know I have), but I think The BFG is probably tailor-made for your needs - and will be faster than any git-filter-branch
based approach.
In your case, you might want to run it with a command like:
$ java -jar bfg.jar --strip-blobs-bigger-than 100K my-repo.git
This removes all blobs bigger than 100K, that aren't in your latest commit.
I did a quick run with this on the jsdelivr repo, and reduced pack size from 284M to 138M in the cleaned repo. The BFG cleaning step took under 5 seconds, the subsequent git gc --prune=now --aggressive
just under 2 minutes.
Full disclosure: I'm the author of the BFG Repo-Cleaner.
Upvotes: 2
Reputation: 26555
You should not cd
to another directory as the git-filter-branch
script will use relative paths to access the files.
Upvotes: 1