Reputation: 9116
I have a codebase that used to be managed with SVN, but is now managed with git. When the code was migrated to git, the history was lost.
I have managed to recover the SVN-history, and am now trying to git-rebase
the more recent commits over the top.
I have two branches, git-commits
, which contains the commits since the migration to git, and svn-commits
which contains the older history. Each branch contains over 3000 commits.
I have found that the following command builds the new history on top of the old (albeit with some manual merge conflict handling):
git rebase git-commits --root --onto svn-commits --preserve-merges
Several of the commits reference commit hashes, and I am aware that these would change when the rebase is done. So that this information is not lost forever, I would like to add the original commit hash of each commit to the newly-rebased commit's message.
This would mean that an original commit like this:
commit aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Author: Boaty McBoatface <[email protected]>
AuthorDate: Wed Jul 27 00:00:00 1938 +0000
Commit: Boaty McBoatface <[email protected]>
CommitDate: Wed Jul 27 00:00:00 1938 +0000
Reticulate splines
The splines had been derezzed, and needed to be reticulated.
Would become something like
commit bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Author: Boaty McBoatface <[email protected]>
AuthorDate: Wed Jul 27 00:00:00 1938 +0000
Commit: Meshy <[email protected]>
CommitDate: Wed Nov 16 10:23:31 2016 +0000
Reticulate splines
The splines had been derezzed, and needed to be reticulated.
Original hash: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Is this possible? Perhaps with git-filter-branch
?
Upvotes: 2
Views: 905
Reputation: 489698
First, a note: be sure you really want to do this, since git replace
(mentioned briefly below) can be used to stitch together the histories in a way that preserves the IDs. It has its own drawbacks too, of course; search for reports from people who have used it.
Yes, you can do this with git filter-branch
.
You might, though, want to combine the "rebase new commits atop new conversion" step with the "... and then edit all the new commits to also contain their old IDs" step, because rebase works by copying commits, and filter-branch works by ... copying commits. :-)
All Git commands that do this kind of thing must copy, since the hash ID of each commit is a function of the commit's contents. If the new commit is different from the original commit in any way, it gets a new, different ID.
The differences between git rebase
and git filter-branch
lie in which commits are copied and how the copying is performed.
Rebase, when done without --preserve-merges
, works by selecting a list of non-merge commits, turning each such commit into a changeset (via subtraction, more or less: child minus parent = delta from parent to child), then adding this delta to the --onto
point or to the commits-added-so-far.
When you use --preserve-merges
, rebase still selects a list of non-merge commits. Then, where there was a merge commit, rebase re-performs the merge (which is why you must resolve merge conflicts all over again). It must re-merge, because the new base may result in a different merge, and because merges cannot be turned into a single changeset ("child - parent" gives you one delta, but there are at least two parents, hence at least two deltas, and in the general case we cannot preserve both).
Filter-branch uses an entirely different approach. The commits to be filtered are selected regardless of whether they are merges or not. (The actual selection is done by running git rev-list
, which is the "plumbing" equivalent of git log
.) This complete list of commit IDs is placed into a pile: a sorted, topological-order pile stored in an ordinary file, so that parent commits are always processed before their children.
Then, for each ID in the list:
Extract the original commit a la git checkout
, into a temporary tree that has no underlying Git repo.
Apply the tree filter to modify the tree. (This modification runs in the temporary directory that holds the temporary tree. That part trips up a lot of people doing their first tree-filter, when they try to access a file like ../../fixed-version
. The relative path fails because the temporary tree is not in the repository at all.)
Reconstruct a new set of Git tree-and-blob-objects representing the new tree, i.e., the new commit snapshot.
Apply the commit message filter to the message.
Apply the commit environment filter to the remaining commit metadata (author and committer stuff).
Make a new commit using the new message and new tree. Or, if you supply a commit filter, use it to make-or-don't-make the commit; and you can also modify the new commit's parent(s) at this point, using the parent filter.
Last, record a pairing: "old commit <oldhash> became new commit <newhash>." (If you skip a commit using a commit filter, the old hash maps instead to its corresponding new ancestor, i.e., the parent that you didn't skip.) This pairing is a map.
This process is extremely slow due to the extract + tree-filter + rebuild part. Therefore, if you don't use a tree filter, git filter-branch
skips this part: it's just going to get the original tree back anyway. To let you modify the new commit's contents anyway, filter-branch also lets you specify an index filter (commits always work from the index anyway, so the extract+modify+rebuild just updates the index; if we can update in place, that's much faster). But—here's the key point—for your purposes you don't need to do anything at all to each tree. All you want is to modify the parentage! This will let you preserve your original merges and their source trees, with no re-merging.
Note that the --commit-filter
description talks about the map convenience function (shell function). This "map" function uses the map I mentioned above. The default is to automatically map to the new parent of the new copied commit.
Finally, after copying all the commits—and, if you provide a --tag-name-filter
, also copying annotated tags and mapping the copies (so if you do have annotated tags, you do want a --tag-name-filter cat
here)—the filter-branch command rewrites some references, i.e., branch and tag names. The original references, which will still point to the original commits (and annotated tag objects), are dumped into the refs/original/
name-space. (This must be empty at the start of the process unless you use --force
.) The rewritten references point into the new copies. The rewrite uses the same mapping technique, so that if there are skipped commits, the names now point to the retained ancestor commits.
("Some" references? Wait, which references? The answer is in the documentation, but it's a bit mysterious: it talks about positive references. The arguments get passed to git rev-list
so that you can filter a specific range of commits, e.g., branch~30..branch
or branch ^otherbranch
. The "positive" references are the ones that actively select commits, while the "negative" references are the ones that limit commits, so for branch ^otherbranch
we have one positive reference, branch
, and one negative, the not-otherbranch part. So this rewrites only refs/heads/branch
and not refs/heads/otherbranch
.)
The reason to explain all of the above is to point out how simple the transplant process is, when using git filter-branch
, and then to show how to access the map.
First, we only need to explicitly replace one single parent ID. Specifically, we want the parent of the root commit in git-commits
to become the existing tip commit of svn-commits
:
$ git rev-parse svn-commits
9999999999999...
(that's the desired new parent), and:
$ git rev-list --max-parents=0 git-commits
11111111111111...
(that's the root commit—with any luck there is only one, otherwise, now what?).
So, we would want a parent filter that says: "if this is commit 1111111... then echo 9999999..., else just echo the arguments back". The default parent arguments are on stdin, as a series of -p <id>
s, with the IDs already mapped. Of course, an existing root has no parents, so stdin will have no contents for the one commit we want to change here. Hence:
--parent-filter 'if [ $GIT_COMMIT = 11111... ]; then
echo -p 999999...; else cat; fi'
This part of the filter-branch
will accomplish our re-parenting. Note that unlike git rebase
, all the trees are simply retained intact. We never convert a snapshot to a delta here, we just take it as-is. This means there is no need to re-resolve merge conflicts.
(Side note: you can actually use the name svn-commits
in place of the hard-coded 99999...
here. You could use a name in place of the hard-coded 11111...
as well but we don't have a name. Also, looking up the name each time will add a tiny bit of delay to the filtering. For the one re-parenting to svn-commits
, that's one tiny delay; for testing whether this is the old root, though, that would be one tiny delay times 3000 commits.)
(Second side note: you can also do this reparenting via "grafts" or its more modern version, git replace
. If a graft or replacement is in force when you run filter-branch
, that graft or replacement becomes permanent, since Git simply copies the commits as instructed, with the instructions also following the replacement.)
That still leaves the problem of filtering the commit messages, to add:
Original hash: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
As shown above, the original hash is in $GIT_COMMIT
, so all we need is this:
--msg-filter 'cat; echo; echo "Original hash: $GIT_COMMIT"'
If we wanted to be fancy, we could even use that map convenience function:
--msg-filter 'cat; echo; echo "new commit $(map $GIT_COMMIT) \
filtered to reparent original commit $GIT_COMMIT"'
or something silly like that, but there's no good reason to bother ... unless you want to get really fancy, and see if you can detect old hash IDs in the commit message and rewrite them in place. I'm not sure if this is even a good idea, and won't attempt to provide a bit of shell script for it, but note that all1 of these filters are "eval"-ed as shell fragments. You can invoke other shell scripts from these eval-ed fragments, just remember that all the filtering is going on in a temporary directory.
Run the filtering on the reference git-commits
. Once the filtering is done, refs/heads/git-commits
will point to the last copied commit, and refs/original/refs/heads/git-commits
will point to the original chain (the one rooted at 11111...
in the above examples).
1Well, almost all. As the documentation says, "with the notable exception of the commit filter, for technical reasons".
We need two filters, --parent-filter
(or a graft or replacement in force), and --msg-filter
. The parent filter says "replace the root of the transplanted copy with the tip of the place we're transplanting onto", and this accomplishes our rebase-without-changing-snapshots. The message filter says "this new commit replaces the commit whose ID we expanded at filtering-time from the variable $GIT_COMMIT
".
Upvotes: 2
Reputation: 1
The answer may depend on the number of commits that you want to rebase.
If the branch that you are rebasing contains fairly low number of commits that you are able to edit manually the following hint may do the work:
https://help.github.com/articles/changing-a-commit-message/
In general the interactive rebase should help you, not necessarily you should go for branch filtering I hope.
r, reword = use commit, but edit the commit message
With interactive rebase try to reword each commit by inserting the original hash in the commit message.
For the bigger number of commits, in ths case 3000 or so let's try with filter-branch:
git filter-branch --msg-filter 'cat && echo "Original hash $GIT_COMMIT"' HEAD~3000..HEAD
It will produce the new commit with rewritten commit msg for each of past 3000 commits of the branch you are sitting on. The new commit msg will have the format similar to this (please note the commit hash at the bottom):
commit 08ac9b84d820ec7b70fa53075adc06f0a8185ab4
Author:
Date: Mon Nov 14 13:14:30 2016 +0100
Adds javadoc
Auto inserted text: ....
Change-Id: ...dbf9497387a3c271ae0349822cb4b8...
Original hash 9d01f3e5b39b15c9dbe923916b6c25019b5b9796
After that you can safely do your rebase. The Old commit hash should be preserved.
BR Maciej
Upvotes: 0