megawac
megawac

Reputation: 11373

Git tree-filter run python script on commits

I was asked this question on #git earlier but as its reasonably substantial I'll post it up here. I want to run a filter-branch on a repo to modify (thousands of) files over hundreds of commits using a python script. I'm calling the clean.py script using the following command in the repo directory:

git filter-branch -f --tree-filter '(cd ../cleaner/ && python clean.py --path=files/*/*/**)'

Clean.py looks like this and will modify all files in path (i.e. files/*/*/**):

from os import environ as environment
import argparse, yaml
import logging
from cleaner import Cleaner

parser = argparse.ArgumentParser()
parser.add_argument("--path", help="path to run cleaner on", type=str)
args = parser.parse_args()

# logging.basicConfig(level=logging.DEBUG)

with open("config.yml") as sets:
    config = yaml.load(sets)

path = args.path
if not path:
    path = config["cleaner"]["general_pattern"]

cleaner = Cleaner(config["cleaner"])

print "Cleaning path: " + str(path)
cleaner.clean(path, True)

After running the command the following is outputted to terminal:

$ python deploy.py --verbose
INFO:root:Checked out master branch
INFO:root:Running command:
'git filter-branch -f --tree-filter '(cd C:/Users/Graeme/Documents/programming/clean-cdn/clean-jsdelivr/ && python clean.py --path=files/*/*/**)' -d "../tmp"' in ../jsdelivr
Rewrite 298ec3a2ca5877a25ebd40aeb815d7b5a5f33a7e (1/1535)
Cleaning path: files/*/*/**

C:\Program Files (x86)\git/libexec/git-core\git-filter-branch: line 343: ../commit: No such file or directory
C:\Program Files (x86)\git/libexec/git-core\git-filter-branch: line 346: ../map/298ec3a2ca5877a25ebd40aeb815d7b5a5f33a7e
: No such file or directory
could not write rewritten commit
rm: cannot remove `/c/Users/Graeme/Documents/programming/clean-cdn/tmp/revs': Permission denied
rm: cannot remove directory `/c/Users/Graeme/Documents/programming/clean-cdn/tmp': Directory not empty

The python script executes successfully and modifies the files correctly but the filter-branch doesn't finish fixing up the commit. There appears to be a permission issue however I haven't been able to get around it running with elevated privileges. I've tried running the filter-branch on win7, win8, and ubuntu with git v1.8 and v1.9.
Edit The script works as is on Centros with git1.7.1

The goal is to reduce the size of a CDNs repo (nearing 1GB) after the contents in files/*/*/** finishes syncing with a database.
The source code of the project
Target repo for the rewrite

Upvotes: 4

Views: 1529

Answers (3)

Roberto Tyley
Roberto Tyley

Reputation: 25314

The permissions issue you're encountering is interesting-are you doing this on a local copy of the repo (ie one where you have full access to the filesystem), or on a remote server?

Reading over your python code, it looks like you're trying to remove every file over a certain size that is not a .INI file, did I get that right?

If that's the case, can I ask if you've considered The BFG Repo-Cleaner? Obviously, you learn a lot about Git by writing your own code (I know I have), but I think The BFG is probably tailor-made for your needs - and will be faster than any git-filter-branch based approach.

In your case, you might want to run it with a command like:

$ java -jar bfg.jar --strip-blobs-bigger-than 100K  my-repo.git

This removes all blobs bigger than 100K, that aren't in your latest commit.

I did a quick run with this on the jsdelivr repo, and reduced pack size from 284M to 138M in the cleaned repo. The BFG cleaning step took under 5 seconds, the subsequent git gc --prune=now --aggressive just under 2 minutes.

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Upvotes: 2

ash
ash

Reputation: 101

Consider using BFG. It is much faster and simpler to use.

Upvotes: 0

michas
michas

Reputation: 26555

You should not cd to another directory as the git-filter-branch script will use relative paths to access the files.

Upvotes: 1

Related Questions