Cœur
Cœur

Reputation: 38737

clean git history of deleted files, keeping renamed files history

I'd like to extract some files to a new repo, keeping their history, including files renaming.

Best and closest answer I could find was new-repo-with-copied-history-of-only-currently-tracked-files, using git filter-branch --index-filter. It successfully keeps history of existing files, but it doesn't preserve history of renamed files.

(Another answer I could find was using git filter-branch --subdirectory-filter. But it has two issues: doesn't seem to work for the whole repo (folder '.') and doesn't preserve history of renamed files.)

(Yet another answer was using git subtree. But it doesn't keep history at all.)

So I'm probably looking for a way to improve the git ls-files > keep-these.txt command from closest answer to also list all previous file names. Maybe a script?

Upvotes: 2

Views: 524

Answers (1)

torek
torek

Reputation: 489998

Git doesn't store file name changes.

Each commit stores a complete tree, e.g., perhaps commit 1234567... has files README and foo.txt and commit fedcba9... has files readme.txt and foo. If you ask git to compare commit 1234567 to commit fedcba9, and README is sufficiently similar1 to readme.txt, git will say that the way to transform the one commit to the other is to rename the file. (If the one commit is the parent of the other, git show of the child commit will show the rename, because git show computes this change at git show time.)

On the other hand, if the second readme file is too different, but README is sufficiently similar to foo, git will say that the way to change 1234567 to achieve fedcba9 is to rename README to foo.

The key is that git computes that when you ask for the comparison, and not a moment earlier. There's nothing in between the commits that says "rename some files". Git simply compares the commits and decides then whether the files are similar enough.

For your purposes, what this ultimately means is that for each commit in your sequence-of-commits-to-copy-or-partially-copy, you'll have to decide which path names to keep and which to discard. How to achieve that is mostly up to you. The git log command does have a --follow flag to activate a limited amount of rename detection as it works backwards from child commits to their parents, and git blame automatically tries to do the same; you could use these (one path name at a time) to come up with a mapping of the form:

      in:   commits A..B    C..D             E..F
use path:   dir/file.ext    dir/frill.txt    lib/frill.next

for instance. But there's nothing built in to do this, and it won't be particularly easy. I'd start by combining git log --follow with --raw or --name-status output and seeing if there are any interesting Renames detected. If and when there are, those are the commit boundaries at which you'll want to change which paths you're keeping and discarding as you work through commits (whether that's with filter-branch or some other method).

If that doesn't work, or you need more control, consider running git diff --name-status between various commit pairs (with commit pair info coming from git rev-list).


1As long as you've asked for rename detection, "exactly the same" is sufficiently similar, as is anything down to about "50% similar". You can tweak the required similarity with the optional value you supply to git diff's -M flag.


Edit: this seems to work OK. I used it on git's own builtin/var.c, which used to have two previous names according to this:

$ git log --follow --raw --diff-filter=R --pretty=format:%H builtin/var.c
81b50f3ce40bfdd66e5d967bf82be001039a9a98
:100644 100644 2280518... 2280518... R100       builtin-var.c   builtin/var.c

55b6745d633b9501576eb02183da0b0fb1cee964
:100644 100644 d9892f8... 2280518... R096       var.c   builtin-var.c

The --diff-filter suppresses everything but rename outputs so that we get to see which commit seems to rename the file. Turning this into something more useful requires a bit more work, but this might get you fairly far:

git log --follow --raw --diff-filter=R --pretty=format:%H builtin/var.c |
while true; do
    if ! read hash; then break; fi
    IFS=$'\t' read mode_etc oldname newname
    read blankline
    echo in $hash, rename $oldname to $newname
done

which produced:

in 81b50f3ce40bfdd66e5d967bf82be001039a9a98, rename builtin-var.c to builtin/var.c
in 55b6745d633b9501576eb02183da0b0fb1cee964, rename var.c to builtin-var.c

Upvotes: 3

Related Questions