Adam Rodger
Adam Rodger

Reputation: 3562

Split a Subversion Repository Project to Two Git Repositories

I have a Subversion server with a few different projects in the standard layout like so:

ProjectA/
    trunk/
    branches/
    tags/
ProjectB/
    trunk/
        FolderOfBinaries/
        SourceFolderA/
        SourceFolderB/
        SourceFolderC/
    branches/
    tags/
        v1.0/
        v1.1/
        v2.0/
ProjectC/
    trunk/
    branches/
    tags/

ProjectB is going to be be migrated to Git, but not with a standard clone. I want to split the project into two Git repositories - one for the folder full of large binaries that change relatively often and another repository for everything else. I did a clone of the repository in full and it's a few GBs, but the binaries folder is probably 90% of that, and running git gc takes a long time. I'd rather have a small fast repository and then add the binaries folder as a submodule if the developer requires it.

I've found two potential options so far. First, I could use git branch-filter to try and remove the folder of binaries from the history as shown in the Git Book. Second, I could use svndumpfilter to split the current Subversion repository into two and then git svn clone each separately.

My question is though, what will happen to all the history, and particularly the branches and tags? I'd still like to know what the folder of binaries looked like at every tag in the project, even though the binaries may not have changed between two tags. is that possible?

Edit: The folder of binaries is not full of build artefacts (*.class, *.o, *.dll etc) so I can't just strip it out and make them external. It's full of binaries that are output from a third-party program that need to be versioned (think OpenOffice documents, Photoshop files etc.).

Upvotes: 4

Views: 2517

Answers (3)

Adam Rodger
Adam Rodger

Reputation: 3562

Well, I've managed to do this, but it wasn't all that straightforward. There may be a better way but not one that I could work out. I did the following:

  1. Create a dump of the current repository: svnadmin dump /opt/repo > full_dump

  2. Filter the dump to remove the binaries folder: svndumpfilter exclude *folderofbinaries* --pattern --renumber-revs --drop-empty-revs < full_dump > filtered_dump. I needed to make folderofbinaries a pattern because way back in the past someone had actually checked a binary directly into a tag (!) so the next step was failing due to a missing folder.

  3. Create a local SVN repository with the filtered dump: mkdir repo-filtered; svnadmin create repo-filtered; svnadmin load repo-filtered < filtered_dump

  4. Clone both the full and filtered repo into different folders (I used svn2git). The filtered repo will not contain any of the binaries. If, in the full repo, only the binaries folder changed between tags A and B, in the new filtered Git repo the two tags will point to the same commit, which is exactly what I wanted.

  5. In the full Git repo, use Git to strip out everything except the binaries folder.

The reason that I had to use Git to isolate the binaries folder was because I couldn't work out how to maintain the tags just using svndumpfilter (especially given I had a binary committed directly into a tag). After the conversion I get the same behaviour as in the filtered repo - if no binaries changed between two tags then they both point to the same commit.

The commands for the final step were:

git checkout master
git filter-branch --tag-name-filter cat --prune-empty --subdirectory-filter folderofbinaries -- --all
git reset --hard
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
git reflog expire --expire=now --all
git gc --prune=now

which I got from this question.

Now I have an 80MB sources repository and a 1.5GB binaries repository from my original 4.4GB SVN dump file! I can recreate the exact state of the original SVN repo by adding the binaries folder as a Git submodule of the sources repo and checking out the same tag on each (which is why I needed to preserve all the tag info) whilst not having one mammoth Git repo that's slow to work with.

Upvotes: 1

David W.
David W.

Reputation: 107030

Take a look at svndumpfilter. It's pretty simple to use. You do a Subversion repository dump, and then use the filter to either say what you want or what you don't want.

Do a dump of your current repository, then run svndumpfilter twice -- once for each Git repository. You can chain them. Just run it twice for each Git repository.

$ svndumpfilter include ProjectB < svn_repo_dump | svndumpfilter exclude ProjectB/trunk/folderofbinaries > svn_repos_no_binaries

I do want to mention one thing: Don't store built binary objects in your repository. In Subversion, they're impossible to remove without a dump and filter, and even in version control systems with the ability to obliterate revisions, doing so takes a lot of time and effort. It's a big maintenance headache.

And for what? Storing binaries in a version control system doesn't really help. You can't diff binaries, the history doesn't help, and they are hard for non-developers to access.

Instead, use a release repository, and store your binaries there. You can use a Maven repository like Artifactory or Nexus even if you don't use Maven or even use Java.

Upvotes: 1

Stefan Ferstl
Stefan Ferstl

Reputation: 5255

I would recommend svndumpfilter to first split ProjectB into two repositories. Afterwards you can use git svn clone to convert the new SVN repositories into GIT repositories. When the --include patterns of svndumpfilter consider the trunk, branches, and tags folders, the full history of the split repositories will be preserved. So you can take a look at all the history of FolderOfBinaries in the new binaries repository.

When you create the GIT repositories using git svn clone, the content of the branches folder will be converted to GIT branches and the content of the tags folder will be converted into GIT tags.

Upvotes: 1

Related Questions