Syncing two separate Git repos

Question

We have to work on another teams repo but for various reasons we want to keep our development work out of view until it is ready to push to their repo.

I cloned their repo, deleted the .git folder then created a new repo with the files. This got us up and running but I am not sure now the best way to merge changes from their repo to our repo.

Is there a better setup where we can work in isolation but not have such an awkward time keeping on top of merges?

Mark Adelsberger · Accepted Answer

By deleting the .git folder, you removed the history. Then by creating a new repo from those files, you started a new history which, in git's view, is unrelated. Even if you committed before any changes were made, still you created a new commit (with coincidentally identical content).

What you want is for your team's repos to preserve the upstream repo's history so that your repos know how your changes relate to that history, and then it's (relatively) easy to merge in upstream changes. Then you control upstream visibility simply by controlling what you push to that origin.

This can still be achieved, but the exact steps depend on a number of questions about your current state. I'm going to have to make some assumptions, so comment if these assumptions are wrong and I can try to adjust the answer accordingly.

So presumably you created your new repo in such a way that you have a remote named origin from which your developers clone their local repos. (Going forward I'll call that origin, and the original repository from which you took the first snapshot I'll call upstream.)

I'll also assume active development has been going on in your origin, so that it would be a problem to just re-clone from upstream; and that you are at the point where you want to integrate changes from upstream into origin.

First step, working from a clone of origin, is to pull in the upstream history.

git remote add upstream url/of/upstream/repo
git fetch upstream

Next you need to integrate the histories. There are a couple ways to do that. The best long-term results would come from doing a history rewrite, though it requires some coordination among your team.

Ideally you would tell everyone on the team to push their work to origin; it needn't be fully merged, but all branches need to be pushed. After pushing their work, they should discard their clones and wait for you to tell them the rewrite is complete, at which point they would create new clones of origin (and, optionally, add the upstream remote as well).

If that level of coordination isn't possible, a rewrite can still be done; but it will likely be more work for some or all of the devs, as they'll have to perform their own migration of any changes that weren't in origin at the time you cloned (or did your last fetch before the rewrite). In that case, it might be better to use a non-rewriting option (more on that in a bit).

Having fetched all of your team's history (or replicated it when you cloned your working repo), you next need to identify the commit in the upstream history that corresponds to the initial commit in your origin repository.

If this was a tagged version, that may make things simple; but I'm assuming it was just whatever was on master at the time. But even in that case, if you know about when you did it, you should be able to track down the correct commit and verify with git diff that it matches your initial commit.

So you find that commit and we mark it O in the following graph; and for convenience you might want to tag it.

X -- X -- O -- X -- X -- X <--(upstream/master)

o -- A -- B <--(master)(origin/master)
 \
  C <--(someBranch)

Now, the easiest thing to do is re-parent o, so that its replacement's parent is O. The other option, which produces a "cleaner" result, would be to re-parent A and C, replacing o with O as their parent. The latter can be slightly trickier, but not as much as you might think; so let's look at how to do that. In the above example, you could use something like

 git filter-branch --parent-filter "sed \"s/$(git rev-parse master~2)/$(git rev-parse origin/master~3)/g\"" -- --branches

which should give you

X -- X -- O -- X -- X -- X <--(upstream/master)
          |\
          | A* -- B* <--(master)
           \
            C* <--(someBranch)

o -- A -- B <--(refs/original/refs/heads/master)(origin/master)
 \
  C <--(refs/original/refs/heads/someBranch)

Then you can either force-push (git push -f) all of the rewritten branches to origin, or recreate the origin from the new repo.

Note that the local repo in which you did the rewrite will have new refs under original/refs/heads, representing the pre-rewrite locations of the branches. Also note that the remote tracking refs for origin are not yet updated (until you do the force pushes, or until you remove the origin remote and re-add it using a remote that reflects the rewrite).

So... what if you decide a rewrite can't be done? Well, in that case you probably want to have a single "integration repo", cloned from origin with upstream added. In the integration repo you would set up a git replace mapping, telling git "whenever you encounter object o, use object O instead". This "papers over" the problem. It can have a few quirks (see the git replace docs), but ideally you'd be able to stop relying on the replace mapping (and the integration repo) after some time has passed. The developers' history would not end up quite as clean as with a rewrite, but there'd be no need for a cut-over to a rebuilt repository.

The idea here is that eventually the histories will be "combined enough" that git will understand what to do without the replacement. This would have to do with how merge bases are calculated. Consider a simple case.

X -- X -- X -- O -- A -- B -- C <--(upstream/master)

o -- D -- E <--(master)

Now you want to merge the upstream changes into master. In your integration repo you've said

git replace master~2 upstream/master~3

which we might draw as

X -- X -- X -- O -- A -- B -- C <--(upstream/master)
               :
               o -- D -- E <--(master)

so git commands by default "see"

X -- X -- X -- O -- A -- B -- C <--(upstream/master)
                 \
                  D -- E <--(master)

Meaning if you say

git checkout master
git merge upstream/master

the calculated merge base will be O and git will "think" it's giving you

X -- X -- X -- O -- A -- B -- C <--(upstream/master)
                 \             \
                  D ---- E ---- M <--(master)

which is really

X -- X -- X -- O -- A -- B -- C <--(upstream/master)
               :               \
               o --- D --- E --- M <--(master)

In this example, master is the only branch receiving changes from upstream, and at this point master's history tracks back into upstream/master; so next time you merge upstream/master into master the merge base should be C, and the replacement is no longer needed (so this could be performed in any clone, rather than needing to take place in a specially-set-up integration repo).

Now I mentioned that the developers' history would not be as "clean", and the obvious thing is that in the final state after you stop using the replacement, we have

X -- X -- X -- O -- A -- B -- C <--(upstream/master)
                               \
               o --- D --- E --- M <--(master)

so the "lineage" of D, E, and M is somewhat broken. In particular, it's not obvious how M should be the result of merging C into E. This could be seen as an "evil merge", though it's not as bad as some in the sense that the default merge (without using a replacement) would generate merge conflicts anyway.

Syncing two separate Git repos

Answers (2)

Related Questions