Reputation: 389
Sorry if this is a duplicate of previous question, but I couldn't find quite what I'm looking for. I'm in the process of converting a large cvs codeset (20+ repositories with 15 years of history - 10-15 GB size) to git. Much of the size is due to binaries that were committed along with the code in the past. While some of the binaries are files that can be removed completely, it's desirable to keep many of them as well as their history. However, we don't want the repo to bloat.
We are currently planning on using git-fat to store the binaries, but I'm in the process of writing a script to automatically convert the files. My first step is to just try to identify all the files in the repo (included deleted files) which are binaries. Are there any simple approaches to accomplishing this? Thanks for your help
Edit
I actually think I found a reasonable approach where I just run
git log --numstat <first commit hash> HEAD
This prints out a list of all the files with two columns in front, the first contains the number of changes to the file (I'm not sure if it's in bytes or lines). But the important parts is for binary files it is '-'. By selecting lines with this tag, and "uniqueing" them, I believe I get the complete list of binary files.
Are there any flaws with this strategy?
Upvotes: 17
Views: 6597
Reputation: 5320
tldr;
git log --all --numstat \
| grep '^-' \
| cut -f3 \
| sed -E 's|(.*)\{(.*) => (.*)\}(.*)|\1\2\4\n\1\3\4|g' \
| sort -u
Explanation:
The git-log
option --numstat
shows number of added and deleted lines in decimal notation and pathname without abbreviation, to make it more machine friendly. For binary files, outputs two - instead of saying 0 0.
Source: https://git-scm.com/docs/git-log, emphasis mine
This produces output entries like the following:
commit 0123456789012345678901234567890123456789
Author: Joe Example <[email protected]>
Date: Thu Mar 9 15:33:29 2017 +0000
edit Dockerfile, add assets/foobar.jpg
1 1 Dockerfile
- - assets/foobar.jpg
The grep '^-'
matches lines with a leading hyphen, the cut -f3
prints the third tab-delimited field, and the
sed -E 's|(.*)\{(.*) => (.*)\}(.*)|\1\2\4\n\1\3\4|g'
detects files that have been moved/renamed and prints both the source and destination; e.g., it would change this:
path/to/{foo => bar}/my-document.pdf
to this:
path/to/foo/my-document.pdf
path/to/bar/my-document.pdf
Finally, the sort -u
will accumulate, sort, and uniquify the list of paths.
EDIT: This answer assumes the existence of a sed
that supports extended regular expressions and capture groups; e.g., https://www.gnu.org/software/sed/ .
Upvotes: 15
Reputation: 485
One of the contributors to git-fat here.
If you're primarily concerned about the size of the file, and not specifically the type, then git-fat has a find
command which allows you to find all the files in the git repository over a given size.
I currently contribute to cyaninc's fork, but both versions (Jed's and Cyan's) have the find command.
Also check out the retroactive import section on the READMEs. Both versions also support that as well.
Upvotes: 2
Reputation: 8305
One solution would be to iterate through all revisions, get all files from each revision, get content of each file and then get a type of each file, so...
Here is how you can get list of all revisions:
$ git rev-list HEAD
32a9b9158d73dc80b355993a5a5f8fc49ae25334
9946574838bf5f984f5f4a19b2fc524f0a60378c
3f82a5dcecde0028da21fb266c1bbd7e9ec762ec
...
Here is how you can get a list of all files in a revision:
$ git ls-tree -r 32a9b9158d73dc80b355993a5a5f8fc49ae25334
100644 blob dcf290b1a99a8d2535b8aa8f85702cd1b7fac6e8 .gitignore
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 README
You can get content of each file by providing blob of each file in each revision using
git show:
$ git show dcf290b1a99a8d2535b8aa8f85702cd1b7fac6e8
.gitignore
*.pyc
rm_pyc.sh
aima/**/*.pyc
.idea
To test if a file is binary or not you can use /bin/file:
git show dcf290b1a99a8d2535b8aa8f85702cd1b7fac6e8 > file
/bin/file file
file: ASCII text
Upvotes: 1