Reputation: 4359

Turning a forked repository back into a branch

I'm working with a codebase A which was at some point forked to codebase B. These codebases live in separate Git repositories, and we can assume that all development has happened on a single branch in each one.

Unfortunately the history of B is not complete, and starts with an "Initial import" commit that starts somewhere in the middle of the development A.

Subsequently, both A and B were developed separately and diverged.

I'd like to try to wrangle this mess so that B is a branch of A, with a common history up to the point that they diverged.

Upvotes: 2

Answers (3)

torek

Reputation: 489293

There's a relatively easy way to do this that might be "good enough". It requires that you pick one commit in repository A that you will declare to be the "base commit" of all commits in repository B. It encourages, but does not require, that you then replace all the commits from B with new-and-improved commits, which you can do with git filter-branch.

Background

The process is fairly simple and easy to visualize as long as you remember that each Git commit is a snapshot of all source files, plus some metadata. The metadata in each commit gives:

who made the commit (author and committer), when (time-stamps for each of those), and why (log message);
the tree that represents its snapshot; and
the parent commit(s), so that Git can follow history backwards, one commit at a time, from each commit to its predecessor(s).

This information forms all the commits in a repository into a graph G = (V, E), where V and E are sets of nodes and edges. Each vertex V is denoted by one commit hash ID—each hash ID is unique to that one commit—and each edge E is a one-way arrow or arc, with the edge-set built from all the stored parent hash IDs in each commit. Intuitively, this means we can draw a simple linear graph like this:

[start] o <-o <-o ... <-o  [end]

A branch-y graph simply forks somewhere:

        o--o   <-- end1
       /
o--o--o
       \
        o--o--o   <-- end2

and a graph that forks, then re-merges, might look like:

        o--o
       /    \
o--o--o      o--o   <-- master
       \    /
        o--o

The end nodes are found via branch names, such as end1 and end2, or master. Since Git's internal arrows are all one-way, pointing backwards, we need these starting-points (ending-points?) to be able to find the rest of the commits.

There's no requirement that the graph be connected:

A--B--C   <-- br1

F--G--H   <-- br2

could represent a repository with six commits and two branches. Commit C is the last commit on branch br1 and commit H is the last one on br2; working backwards from the two tips, we can enumerate all six commits.

Your case

In your case, you have a repository A with some set of branches—perhaps just one—which identify some set of tip commits, and a separate repository B with some set of branches, again maybe just one, identifying some set of tip commits. Using git remote add <url> within A (or any identical clone of A), you can have the Git for that repository call up the Git for a repository-B, obtain all of its commits and branches, and put them into repository A.

Let's draw an example, assuming for simplicity that repository A has exactly three commits ending in master:

A--B--C   <-- master

and that repository B also has three commits, ending in its master:

D--E--F   <-- master

We'll use remote name remote-B to have Git A call up Git B and get its commits. Git A will rename B's master to remote-B/master in the process, so now we have:

A--B--C   <-- master

D--E--F   <-- remote-B/master

in A. So now we have one repository with two disconnected graphs. You can, if you wish, now attach a regular branch name—rather than a remote-tracking name—to commit F:

git branch develop remote-B/master

so that F is named by both develop and remote-B/master, and you can now, if you wish, git remote remove remote-B to remove the name remote-B/master. Commits D-E-F remain in your repository, find-able via your name develop.

Connecting the graphs

Now let's say that we decide that commit D, which we got from B but is now ours as well, closely resembles commit B, and we'd like to pretend that D has B as a parent:

A--B--C   <-- master
    \

      D--E--F   <-- develop

We can't actually change commit D. Its unique hash ID is uniquely D-ish. We could extract commit D and make a copy of it that's slightly different—a D', if you will—and make D' re-use everything from original commit but have B as its parent:

A--B--C   <-- master
    \
     D'

      D--E--F   <-- develop

We'd then need to copy E and F to E' and F', where E' is just like E except that it has D' as its parent, instead of D as its parent; and F' is like F but has E' as its parent:

A--B--C   <-- master
    \
     D'-E'-F'

      D--E--F   <-- develop

Then all we have to do is peel the name develop off commit F and paste it onto F' and we have what we want:

A--B--C   <-- master
    \
     D'-E'-F'   <-- develop

      D--E--F   [abandoned]

But how do we arrange to copy D to D', E to E', and F to F'? The answer is to use git filter-branch, which can copy commits and make changes as it does its copying. There's a hard way—not really that hard, but it's harder than the easier way—and an easier way, and we should actually start with the hard way.

The hard way is to use --commit-filter. Here, we have a say in how each new commit is to be made. The default action is to use git commit-tree "$@"; if we did that, we'd make all the copies 100% identical to the originals, so that instead of making a D' we'd just actually re-use D, then re-use E and F and we'd have no change. So we have to use something more complicated:

'if [ $GIT_COMMIT = ___ ]; then git commit-tree -p ___ "$@"; else git commit-tree "$@"; fi'

except that we have to fill in both blanks. The first one, we'd fill in with the actual hash ID of commit D. The second, we'd fill in with the actual hash ID of commit B. This says: when we're copying original commit D, add B as a parent. Since D itself has no parents, this makes Git create D' with one parent, namely B. After that, when filter-branch copies E, it will use D' as its parent, creating E'; and then it will use E' as the parent for the copy of F. Then, having copied all the commits that we need to copy (which is really just D-through-F), filter-branch will yank the old branch name(s) off the old commits and make them point to the new copies.

Getting to this state more easily

The problem with the above is that we have to find the hash IDs of B and D, and then type them in very carefully. One slip-up, one single-character typo, and this whole filter-branch operation—which can be pretty slow, depending on how many commits you copy—is ruined and we have to start over. (If you're really good at Git, this starting-over is not that bad, but this is a pretty rare exercise and few people are good at it.)

So, instead of doing it this way, we can use git replace. What git replace does is—well, it has several modes of operation, but the one we'll use here is: it copies one commit, and lets us make changes before making the copy. We'll copy commit D to an altered D'. Having made the copy, it then arranges for most Git commands to switch to the copy automatically.

So, what we'll do here is find the hash ID for D—that's pretty easy:

git rev-list --topo-order develop | tail

for instance. The rev-list walks the list of commits. The --topo-order ensures that even if there's weird internal branching and such, the first commit comes out last. This only fails if there are multiple first commits, i.e., we have a situation like:

D
 \
  F   <-- develop
 /
E

in which case we have to replace both D and E. Or we can use:

git rev-list --max-parents=0 develop

which lists all root commits reachable from develop, which finds us both D and E directly if we have something like the above.

Anyway, having found D—and assuming there's just the one commit—we now want to replace it with its D' copy. Now we need to pick some commit such as B, using git log on the original set of commits and picking a usable one. Which one we pick isn't all that important, but you can run:

git diff <hash-of-B> <hash-of-D>

to see how close the two snapshots are. A commit with a "really close" or "exact match" snapshot is a good candidate for the new parent.

Now we run:

git replace --graft hash-of-D hash-of-B

This makes our D', along with a special name, refs/replace/hash, that Git uses so that every time it comes across D, Git quickly looks away to D' instead. Since D' has parent B, most of Git now believes that from commit E, the next-back commit is D', and then there's another commit one step back from that: B.

That is, we now have:

A--B--C   <-- master
    \
     D'   <-- refs/replace/<big-ugly-hash-of-D>

      D--E--F   <-- develop

You can stop at this state, but note that if you do, any clone of this repository won't clone the replacement D' commit by default. So the clone won't look aside from D to D' and won't think that the history goes F to E to D' to B to A. The clone will see the true history, F to E to D and stop.

You can make clones pick up the replacement commit, after which they'll pretend that the histories are joined. But it's simpler now to just actually-join the histories, using git filter-branch. By default, filter-branch obeys replacements—so it will copy commit A (with no changes so that the result is A), then B, then (in some order) C and D'. After it has copied D'—with no changes so that the result is D'—filter-branch will copy E using D' as E's parent, then F using E' as F's parent. So now you have the same result you would have if you had run git filter-branch with the right --commit-filter and no typos.

The other nice thing about using git replace here is that you can:

delete the replacement, if you don't like it, or
replace the replacement with a better replacement (equivalent to delete and then re-replace), using the -f / --force flag.

So you can experiment with different joinings-up of the histories, and decide which one you like the most, before cementing it into place with git filter-branch. Before the cementing, you can still obtain new commits from repository B. After the cementing, you've, well, committed, if you'll pardon the phrasing, to the new replacement commit hash IDs and you can no longer easily incorporate new commits from B.

Upvotes: 3

Lienhart Woitok

Reputation: 426

There is a way to merge these to repositories into one, but it is a bit complicated and I would strongly recommend to save a backup of your complete repository somewhere safe beforehand, e.g. git clone --mirror https://git.example.org/repo.git and keep this backup for a while in case you discover problems at some point in the future.

The method I describe uses git replace to first tell git to replace a certain commit with a different commit, namely replace the initial import commit of B with the corresponding commit of A and afterwards rewriting the whole history to make these replacements permanent as the replace mechanics are not too stable across all git commands. It is better to not have those in your repo forever.

To better illustrate the procedure and let you try it first, I prepare some test repo first.

mkdir orig-repo
cd orig-repo/
git init
touch foo
git add foo
git commit -m 1
echo bar > foo
git commit -m 2 foo
echo foo > bar
git add bar
git commit -m 3
cd ..
mkdir fork-repo
cd fork-repo/
git init
cp ../orig-repo/foo ../orig-repo/bar .
git add foo bar
git commit -m a
echo baz >> foo
git commit -m b foo
cd ../orig-repo/
echo bla >> bar
git commit -m 4 bar

This just creates two distinct repos with a couple of commits and a shared set of files.

Working of this base, let's merge these two repos together:

user@host:/tmp/git-replace-test/orig-repo (master)$ git remote add fork ../fork-repo/
user@host:/tmp/git-replace-test/orig-repo (master)$ git fetch fork
warning: no common commits
remote: Enumerating objects: 7, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 7 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (7/7), done.
From ../fork-repo
 * [new branch]      master     -> fork/master

Basic. Just fetch the fork repo so its commits are available in our original repo.

user@host:/tmp/git-replace-test/orig-repo (master)$ git replace --graft $(git rev-parse fork/master) $(git rev-parse master~)

This is the import one, it tells git that the first commit of our fork repo is to be replaced by a certain commit of our original repo. --graft option practically changes the parent of a commit to something else, here we replace the initial commit with the whole history of that commit from the original repo.

Please do not use the command as is on your real repo, the revisions I use here work only for my example. The first revision is the first commit of the fork after the initial import commit (b in the example). As our example only has two commits it is the head commit. If there were a third commit it would not be the head commit obviously. The second revision is the commit from which the repo was forked originally. In our case this is commit 3, or one commit behind master. Please insert real commit hashes instead of rev-parse to be sure of what exactly you are doing.

After this you can check with git log, your original repo is still unchanged. But the log of the fork is more interesting now:

user@host:/tmp/git-replace-test/orig-repo (master)$ git log fork/master
commit 61ca43d062128c9fcddb9352698363e1bcf12a86 (replaced, fork/master)
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    b

commit 2ff1b501ecadf5af0fcb1462e6aece1f70aa2ab6
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    3

commit 646f4082ee2cb77cd11179fe33be1890f04a4c7d
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    2

commit c6501b32d69cdc9d79bcd6dc6b8220456c4ceb02
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    1

Noticed the word "replaced" at the head commit? Our replace command replaced this commit with a commit with identical content but a different parent. Looks good, right? You can check, the commit hashes from commits 1 to 3 are the same as on our original master branch. Looks good for merging, so let's do that now.

user@host:/tmp/git-replace-test/orig-repo (master)$ git merge -m 'Merge in fork' fork/master
Merge made by the 'recursive' strategy.
 foo | 1 +
 1 file changed, 1 insertion(+)
user@host:/tmp/git-replace-test/orig-repo (master)$ git log
commit 9d8a33dd4ec3a8bdf746e717ccc3d9df74af66f5 (HEAD -> master)
Merge: a406f35 61ca43d
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:37 2019 +0200

    Merge in fork

commit a406f35bae904389b739c2a06cebd15e87146f21
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    4

commit 61ca43d062128c9fcddb9352698363e1bcf12a86 (replaced, fork/master)
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    b

commit 2ff1b501ecadf5af0fcb1462e6aece1f70aa2ab6
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    3
...

Nearly perfect. But there is still this replaced commit lingering in there. We should get rid of that now as it can cause problems.

user@host:/tmp/git-replace-test/orig-repo (master)$ git filter-branch -- --all
Rewrite c6501b32d69cdc9d79bcd6dc6b8220456c4ceb02 (1/7) (0 seconds passed, remaining 0 predicteRewrite 646f4082ee2cb77cd11179fe33be1890f04a4c7d (2/7) (0 seconds passed, remaining 0 predicteRewrite 2ff1b501ecadf5af0fcb1462e6aece1f70aa2ab6 (3/7) (0 seconds passed, remaining 0 predicteRewrite 7cfc7a8647fd74696852e05635a6eb3c823d3766 (4/7) (0 seconds passed, remaining 0 predicteRewrite a406f35bae904389b739c2a06cebd15e87146f21 (5/7) (0 seconds passed, remaining 0 predicteRewrite 61ca43d062128c9fcddb9352698363e1bcf12a86 (6/7) (0 seconds passed, remaining 0 predicteRewrite 9d8a33dd4ec3a8bdf746e717ccc3d9df74af66f5 (7/7) (0 seconds passed, remaining 0 predicted)    
Ref 'refs/heads/master' was rewritten
Ref 'refs/remotes/fork/master' was rewritten
WARNING: Ref 'refs/replace/61ca43d062128c9fcddb9352698363e1bcf12a86' is unchanged
user@host:/tmp/git-replace-test/orig-repo (master)$ git log
commit d84bcb9117f93655b72843cb051c923d1ea2ddb1 (HEAD -> master)
Merge: a406f35 7cfc7a8
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:37 2019 +0200

    Merge in fork

commit a406f35bae904389b739c2a06cebd15e87146f21
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    4

commit 7cfc7a8647fd74696852e05635a6eb3c823d3766 (fork/master)
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    b

commit 2ff1b501ecadf5af0fcb1462e6aece1f70aa2ab6
Author: User Name <[email protected]>
Date:   Wed Sep 4 23:20:28 2019 +0200

    3
...

Commit b is now a real commit in that branch, but it also got a new hash. The commits from the original repo kept their hashes, so no need to force push or anything.

I hope this gives you the result you were hoping for.

Please bear in mind that this is a simple example. Your mileage with a big repo may vary. I also made the assumption that only one commit in the fork has the inital commit as parent, otherwise you probably have to use replace on each of them.

Good luck.

Upvotes: 2

Code-Apprentice

Reputation: 83557

This is a little tricky since there isn't a shared history between the two repos. With that said, here are two commands that I would use to get started if I were in the same situation:

git remote - You can create multiple remotes in a single repository with git remote. By default, git clone creates a remote named origin. You can add other remotes with git remote add <uri> where <uri> can be a URL or a file path.
git rebase - Use this to copy commits from one history to another. I'm actually not sure how this works when dealing with two unrelated histories. I suggest looking at git help rebase for more information.

Upvotes: 1