Reputation: 597
Is there any good approach to merge forked repositories with same files/folders structure on the HEAD, but different history? Not fully automated workflow is acceptable, since it won't be done very often - but I hope there is better way than copying all files and checking differences manually.:)
Background is that we had to migrate from 10-years-old TFS repository to Git. There was a requirement to keep whole history, but only for master branch. After migrating from TFS - we cleaned up Git repository a bit, but it's still too big for Git.
We migrated it and this branch is still being used for current production deployments. There are also fixes being done in that production branch - so it cannot be ditched for some time, and these fixes are also important to keep.
In parallel we are working on major refactorings in a separate branch, where a lot of current production codebase is still used, but on the other hand - a lot of historical stuff was removed or moved to different repositories.
What I wanted to do is to make a fork and rewrite history (e.g. using BFG Repo-Cleaner), to cleanup all removed projects/objects.
This cleanup part worked well, however we also need possibility to merge changes done on current production branch (only one-way - from production to cleaned-up repo). I tried to do it with adding upstream branch from old repository, but merging upstream repository to repository with rewritten history - makes all cleanup useless. It re-adds all removed objects..
Is there any way to solve it? Maybe such cleanup can be done in some completely different way? There are a lot of similar questions, but didn't find exactly what I need.:)
Upvotes: 1
Views: 887
Reputation: 45659
Update - After reading comments and reviewing my answer, there are some things that can be clarified, some tweaks that will make it easier to use correctly, and one or two outright errors. Sorry about that; as docs go the original answer was "rough draft" quality. I'll address a couple questions first, but I do recommend having a look over the edited answer below as well.
Upstream Configuration - The relationships between branches in each repo are key to what's going on here. The fetch refspecs are going to govern that, and as long as they're set correctly no other "upstream" configuration should be required.
That said, the biggest change I make below is to move the cleaned-up branches in the bridge repository to their own clean/*
namespace, so that fetching the right refs to the clean repository is much simpler.
BFG removing original branches - This is correct, but then after you configure the bridge repo's origin
fetch refspec, the subsequent fetch will recreate the original branches under the
prod/*` namespace.
As to your last comment - I think your previous attempts are just falling victim to errors that arise from the "rough draft" problems with the original answer. Getting the right result is absolutely possible, and I guess as someone fully comfortable with the tools and techniques here I'm automatically looking past the "on-the-fly" corrections that would make it work. But hopefully this rewrite will at least get you closer to what you're trying to do...
You mention that you may have to merge changes from the production repo to the cleaned-up repo. That's not too bad a problem, but do beware that if you need to have changes flow in both directions - i.e. if you could want to update the production branch with changes form the cleaned-up repo - that complicates things and might favor a different approach.
Also, this is easiest if all changes flow from a single branch on the production repo into the clean repo. (It doesn't matter if you use branches within the production repo, but you'd ideally want them all to be merged into a single branch, which becomes the source of changes for a single branch in the clean repo.) If not, the same principles can apply, but the execution is harder.
Note that any approach is only as good as the ability to apply patches from production onto the cleaned code base. To the extent that the cleaning only consists of removing certain files, that's no problem. But if the repos diverge wildly, then conflicts when applying the changes will become an ever-increasing problem regardless of anything you might try.
For a one way flow (prod repo -> cleaned repo), you can keep one repo with both "original" and "cleaned" history. This can be the production repo itself, or a dedicated "bridge repository". (It cannot be the cleaned repository, as it would then contain the large history you're trying to remove from it.)
Exactly how to get to that state from where you are, depends on details of where you are. For illustrative purposes, if you started with this approach in mind it might go like this:
You have your prod repo at <prod-url>
. You clone it, and this clone will be used to make a bridge repository.
$ git clone `<prod-url>` bridge
$ cd bridge
You run BFG in bridge
, and then clone that to create the true "clean" repository. Then (once again in bridge
) you re-configure origin
so that its branches can be mapped to a prod
namespace in the bridge
repo.
$ git config remote.origin.fetch refs/heads/*:refs/heads/prod/*
Now, when you fetch from origin to the bridge repo, instead of updating remote tracking refs, git will try to advance a set of branches in the prod/
namespace. But you do not want those prod/*
branches fetched into your clean repo(s); the easiest way to fix that is to move the cleaned-up branches to a clean/
namespace and reconfigure the clean repo to fetch only the clean/*
branches.
In bridge
, there are several ways to go about moving the branches. If there aren't many, you could do it manually
$ git checkout master
$ git checkout -b clean/master
$ git branch -D master
For lots of branches, you could script this (perhaps using git for-each-ref
to kick things off). Or you could perhaps abuse the filter-branch
backup ref mechanism in some way.
Anyway, once the branches are moved, go to the clean repo and
$ git config remote.origin.fetch +refs/heads/clean/*:refs/remotes/origin/*
Now taking a step back, unlike this last command, when I gave a fetch refspec for origin
in the bridge repo, I omitted the leading +
that is often used on fetch refspecs; that means that if a prod
branch undergoes a history rewrite, the fetch will complain and you'll know you have a potential headache to resolve. More on that later.
So next, in the bridge repo you can run
$ git fetch origin
which will re-load the original branches under the prod/
namespace.
Now you have both the original branches (e.g. refs/heads/prod/master
) and the clean branches (e.g. refs/heads/clean/master
). It could be drawn like this
A' -- B' -- C' -- D' <--(clean/master)
A -- B -- C -- D <--(prod/master)
The histories are unrelated, and you need to keep it that way. But you also want to "know" that the clean/master
branch is "up-to-date" through the D
commit on prod/master
in a way that makes merging future changes easy. One way is to create two additional branches - let's call them bridge-prod
and bridge-clean
.
The bridge-clean
branch will stay pointed at the last commit on which we brought changes in from prod
. New changes may go on in the clean/
branches themselves, but bridge-clean
will remember what a clean-up version of prod
alone would look like.
$ git checkout clean/master
$ git branch bridge-clean
Then bridge-prod
s job is to have the same content as bridge-clean
, until it receives new changes from prod/master
- after which it will be used as a reference for updating bridge-clean
once again.
So to initialize that, we create a copy of D'
whose parent is D
.
git checkout prod/master
git checkout -b bridge-prod
git rm -r ':/'
git checkout bridge-clean -- ':/'
git commit
Now you have
A' -- B' -- C' -- D' <--(bridge-clean)(clean/master)
D" <--(bridge-prod)
/
A -- B -- C -- D <--(prod/master)
where D'
and D"
have identical content (which is the "cleaned" version of D
). Because D"
has D
as its parent, you can merge future changes from prod/master
into bridge-prod
(D
will be the merge base). So after some time you have
... x <--(clean/master)
/
A' -- B' -- C' -- D' <--(bridge-clean)
D" <--(bridge-prod)
/
A -- B -- C -- D ... H <--(prod/master)
The two ...
could include many commits, branches, merges, whatever; it doesn't make a big difference. The important thing is that bridge-prod
and bridge-clean
still represent the last integration between the repos.
So next you want to merge prod/master
to bridge-prod
.
... x <--(clean/master)
/
A' -- B' -- C' -- D' <--(bridge-clean)
D" -- H"<--(bridge-prod)
/ /
A -- B -- C -- D ... H <--(prod/master)
You want H"
to represent the cleaned-up state of H
. For that, there are two conditions to worry about:
If the prod/master
branch updates a file that was removed by the clean-up, then the merge will conflict. Luckily these removals are the only changes on "our" side of the merge, and we know we want to keep them over whatever prod/master
might have done to those files. So when we merge we could say
git checkout bridge-prod
git merge -X ours prod/master
The -X ours
option should not be confused with -s ours
. While -s ours
would use the "ours merge strategy", ignoring the prod/master
changes entirely, -X ours
uses the default merge strategy with the "ours strategy option" (thanks, git, for the clear-as-mud naming).
What this means is, this command will try to merge as normal, but every time there's a conflict the bridge-prod
version of that hunk of code will prevail. Since the only changes on bridge-prod
are removal of files we don't want, this is good.
The other problem would be if prod/master
might have added a new file that should be excluded from the clean-up. If you know that can't happen, no problem. If it could happen, then you need to check for it. For example before merging you could say
git diff prod/master prod/master^
and see if there are any new files that you wouldn't want in the clean repo. If so, then for your merge do
git checkout bridge-prod
git merge -X ours --no-commit prod/master
# remove the unwanted files
git add ':/'
git commit
Now, because D"
is the same content as D'
, that means that H"
has the TREE
you want in the next bridge-clean
commit.
git checkout bridge-clean
git rm -r ':/'
git checkout bridge-prod -- ':/'
git commit
This gives you
... x <--(clean/master)
/
A' -- B' -- C' -- D' -- H' <--(bridge-clean)
D" -- H"<--(bridge-prod)
/ /
A -- B -- C -- D ... H <--(prod/master)
H'
has the same content as H"
- which is the sanitized content, updated through H
. Also, H'
has sanitized history (it's parent is D'
, which we cleaned up at the outset), so it can safely be included in the clean repo. You can merge bridge-clean
to master
and the change transfer is complete.
This is conceptually a bit involved, and takes some up-front setup (and maybe writing a few scripts to use with each integration of changes). But once that's all set up, it minimizes the manual fiddling and lets you make the best applicable use of the merge machinery git provides.
However, it's a one way bridge. If you were to merge bridge-prod
back into prod/master
, you would almost certainly delete files that you want kept in prod/master
.
If you do have to take changes from the clean repo and apply them to the prod repo, you could generate a patch on the clean repo. Tot he extent that the clean repo content is a subset of the prod repo content, the patch should apply without too much hassle. It might cause some spurious conflicts the next time you merge changes from prod down to clean.
One last additional point (mentioned above but then forgotten) - This all assumes that you won't be doing history rewrites in the prod
repo going forward (or at least not often). If you were to do such a rewrite, then just as another user's clone couldn't cleanly pull changes, the bridge wouldn't work normally for integrating the change into the clean repo. You'd have to work out a procedure based on the specifics of the situation.
Upvotes: 2