GreyCat
GreyCat

Reputation: 17104

Using git repository as a database backend

I'm doing a project that deals with structured document database. I have a tree of categories (~1000 categories, up to ~50 categories on each level), each category contains several thousands (up to, say, ~10000) of structured documents. Each document is several kilobytes of data in some structured form (I'd prefer YAML, but it may just as well be JSON or XML).

Users of this systems do several types of operations:

Of course, the traditional solution would be using some sort of document database (such as CouchDB or Mongo) for this problem - however, this version control (history) thing tempted me to a wild idea - why shouldn't I use git repository as a database backend for this application?

On the first glance, it could be solved like this:

Are there any other common pitfalls in this solution? Have anyone tried to implement such backend already (i.e. for any popular frameworks - RoR, node.js, Django, CakePHP)? Does this solution have any possible implications on performance or reliability - i.e. is it proven that git would be much slower than traditional database solutions or there would be any scalability/reliability pitfalls? I presume that a cluster of such servers that push/pull each other's repository should be fairly robust & reliable.

Basically, tell me if this solution will work and why it will or won't do?

Upvotes: 164

Views: 41532

Answers (7)

VonC
VonC

Reputation: 1323025

The backend of Git itself is text-based, which is not suited for SQL queries.

That might change someday, as illustrated with Git 2.44 (Q1 2024), which introduces a new extension "refstorage" so that we can mark a repository that uses a non-default ref backend, like reftable.

See commit 1b22340, commit 5ed860f, commit 48fa45f, commit 3c4a531, commit 58aaf59, commit aa19619, commit d7497a4, commit 58be32f, commit 173761e, commit 0fcc285, commit 465a22b, commit bb0372c (29 Dec 2023) by Patrick Steinhardt (pks-t).
See commit 7a75e13 (20 Dec 2023) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit 32c6fc3, 16 Jan 2024)

builtin/clone: introduce --ref-format= value flag

Signed-off-by: Patrick Steinhardt

Introduce a new --ref-format value flag for git-clone(1) that allows the user to specify the ref format that is to be used for a newly initialized repository.

git clone now includes in its man page:

--ref-format=<ref-format>

Specify the given ref storage format for the repository. The valid values are:

  • files for loose files with packed-refs. This is the default.

Meanwhile, (2023/2024), consider the project dolthub/dolt

Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a Git repository.

Connect to Dolt just like any MySQL database to read or modify schema and data. Version control functionality is exposed in SQL via system tables, functions, and procedures.

Or, use the Git-like command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate's changes. All the commands you know for Git work exactly the same for Dolt.

Git versions files. Dolt versions tables. It's like Git and MySQL had a baby.


But, with that new backend, "git init"(man) in an already created directory, when the user configuration has includeif.onbranch, started to fail recently, which has been corrected with Git 2.46 (Q3 2024), batch 10.

See commit 407997c (22 May 2024) by Patrick Steinhardt (pks-t).
(Merged by Junio C Hamano -- gitster -- in commit 5c7c063, 03 Jun 2024)

setup: fix bug with "includeIf.onbranch" when initializing dir

Reported-by: Heghedus Razvan
Signed-off-by: Patrick Steinhardt
Tested-by: Heghedus Razvan

It was reported that git init can fail when initializing an existing directory in case the config contains an "includeIf.onbranch:" condition:

$ mkdir repo
$ git -c includeIf.onbranch:main.path=nonexistent init repo
BUG: refs.c:2056: reference backend is unknown

The same error can also be triggered when re-initializing an already existing repository.

The bug has been introduced in 173761e ("setup: start tracking ref storage format", 2023-12-29, Git v2.44.0-rc0 -- merge listed in batch #8), which wired up the ref storage format.
The root cause is in init_db(), which tries to read the config before we have initialized the_repository and most importantly its ref storage format.
We eventually end up calling include_by_branch() and execute refs_resolve_ref_unsafe(), but because we have not initialized the ref storage format yet this will trigger the above bug.

Interestingly, include_by_branch() has a mechanism that will only cause us to resolve the ref when the_repository->gitdir is set.
This is also the reason why this only happens when we initialize an already existing directory or repository: gitdir is set in those cases, but not when creating a new directory.

Rearrange the code such that we set up the repository format before reading the config.
This fixes the bug and ensures that "onbranch:" conditions can trigger.


With Git 2.46 (Q3 2024), batch 14, a new command has been added to migrate a repository that uses the files backend for its ref storage to use the reftable backend, with limitations.

See commit 25a0023, commit 6d6a3a9, commit 64a6dd8, commit 1339cb3, commit b5d7db9, commit b3e098d, commit 120b671, commit 66275a6, commit fbd1a69, commit 6e1683a, commit 318efb9, commit a83f7f5 (06 Jun 2024) by Patrick Steinhardt (pks-t).
See commit 0ff6d23, commit e55f364 (23 May 2024) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit 40a163f, 17 Jun 2024)

builtin/refs: new command to migrate ref storage formats

Signed-off-by: Patrick Steinhardt

Introduce a new command that allows the user to migrate a repository between ref storage formats.
This new command is implemented as part of a new git-refs(1) executable.
This is due to two reasons:

  • There is no good place to put the migration logic in existing commands.
    git-maintenance(1) felt unwieldy, and git-pack-refs(1) is not the correct place to put it, either.
  • I had it in my mind to create a new low-level command for accessing refs for quite a while already.
    git-refs(1) is that command and can over time grow more functionality relating to refs.
    This should help discoverability by consolidating low-level access to refs into a single executable.

As mentioned in the preceding commit that introduces the ref storage format migration logic, the new git refs migrate(man) command still has a bunch of restrictions.
These restrictions are documented accordingly.

git refs now includes in its man page:

git-refs(1)

git-refs - Low-level access to refs

git refs migrate --ref-format=<format> [--dry-run]

This command provides low-level access to refs.

migrate

Migrate ref store between different formats.

The following options are specific to 'git refs migrate':

  • --ref-format=<format> The ref format to migrate the ref store to. Can be one of:

    • files for loose files with packed-refs. This is the default.
    • reftable for the reftable format. This format is experimental and its internals are subject to change.
  • --dry-run Perform the migration, but do not modify the repository.
    The migrated refs will be written into a separate directory that can be inspected separately.
    The name of the directory will be reported on stdout.

    This can be used to double check that the migration works as expected before performing the actual migration.

KNOWN LIMITATIONS

The ref format migration has several known limitations in its current form:

  • It is not possible to migrate repositories that have reflogs.

  • It is not possible to migrate repositories that have worktrees.

  • There is no way to block concurrent writes to the repository during an ongoing migration. Concurrent writes can lead to an inconsistent migrated state. Users are expected to block writes on a higher level. If your repository is registered for scheduled maintenance, it is recommended to unregister it first with git-maintenance.

These limitations may eventually be lifted.


With Git 2.48 (Q1 2025), pre 2.48-rc1 (bis), "git refs migrate"(man) learned to also migrate the reflog data across backends.

See commit 8ddcdc1 (20 Dec 2024), and commit 246cebe, commit 297c09e, commit 84675fa, commit 4483be3, commit add2c4f, commit 611986f, commit a3582e2, commit 1a83e26 (16 Dec 2024) by Karthik Nayak (KarthikNayak).
See commit df5d7a7 (15 Dec 2024) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit 6f8ae95, 23 Dec 2024)

refs: add support for migrating reflogs

Helped-by: Patrick Steinhardt
Signed-off-by: Karthik Nayak

The git refs migrate(man) command was introduced in 25a0023 ("builtin/refs: new command to migrate ref storage formats", 2024-06-06, Git v2.46.0-rc0 -- merge listed in batch #14) to support migrating from one reference backend to another.

One limitation of the command was that it didn't support migrating repositories which contained reflogs.
A previous commit, added support for adding reflog updates in ref transactions.
Using the added functionality bake in reflog support for git refs migrate.

To ensure that the order of the reflogs is maintained during the migration, we add the index for each reflog update as we iterate over the reflogs from the old reference backend.
This is to ensure that the order is maintained in the new backend.

Upvotes: 2

JLG
JLG

Reputation: 1

Do you know VMDS? it has been introduced in 1990.

Upvotes: 0

ioquatix
ioquatix

Reputation: 1476

I implemented a Ruby library on top of libgit2 which makes this pretty easy to implement and explore. There are some obvious limitations, but it's also a pretty liberating system since you get the full git toolchain.

The documentation includes some ideas about performance, tradeoffs, etc.

Upvotes: 5

7mp
7mp

Reputation: 129

As you mentioned, the multi-user case is a bit trickier to handle. One possible solution would be to use user-specific Git index files resulting in

  • no need for separate working copies (disk usage is restricted to changed files)
  • no need for time-consuming preparatory work (per user session)

The trick is to combine Git's GIT_INDEX_FILE environmental variable with the tools to create Git commits manually:

A solution outline follows (actual SHA1 hashes omitted from the commands):

# Initialize the index
# N.B. Use the commit hash since refs might changed during the session.
$ GIT_INDEX_FILE=user_index_file git reset --hard <starting_commit_hash>

#
# Change data and save it to `changed_file`
#

# Save changed data to the Git object database. Returns a SHA1 hash to the blob.
$ cat changed_file | git hash-object -t blob -w --stdin
da39a3ee5e6b4b0d3255bfef95601890afd80709

# Add the changed file (using the object hash) to the user-specific index
# N.B. When adding new files, --add is required
$ GIT_INDEX_FILE=user_index_file git update-index --cacheinfo 100644 <changed_data_hash> path/to/the/changed_file

# Write the index to the object db. Returns a SHA1 hash to the tree object
$ GIT_INDEX_FILE=user_index_file git write-tree
8ea32f8432d9d4fa9f9b2b602ec7ee6c90aa2d53

# Create a commit from the tree. Returns a SHA1 hash to the commit object
# N.B. Parent commit should the same commit as in the first phase.
$ echo "User X updated their data" | git commit-tree <new_tree_hash> -p <starting_commit_hash>
3f8c225835e64314f5da40e6a568ff894886b952

# Create a ref to the new commit
git update-ref refs/heads/users/user_x_change_y <new_commit_hash>

Depending on your data you could use a cron job to merge the new refs to master but the conflict resolution is arguably the hardest part here.

Ideas to make it easier are welcome.

Upvotes: 5

young chisango
young chisango

Reputation: 103

my 2 pence worth. A bit longing but ...... I had a similar requirement in one of my incubation projects. Similar to yours , my key requirements where a document database ( xml in my case),with document versioning. It was for a multi-user system with a lot of collaboration use cases. My preference was to use available opensource solutions that support most of the key requirements.

To cut to the chase, I could not find any one product that provided both, in a way that was scalable enough ( number of users, usage volumes, storage and compute resources).I was biased towards git for all the promising capability, and (probable) solutions one could craft out of it. As I toyed with git option more, moving from a single user perspective to a multi ( milli) user perspective became an obvious challenge. Unfortunately, I did not get to do substantial performance analysis like you did. ( .. lazy/ quit early ....for version 2, mantra) Power to you!. Anyway, my biased idea has since morphed to the next (still biased ) alternative: a mesh-up of tools that are the best in their separate spheres, databases and version control.

While still work in progress ( ...and slightly neglected ) the morphed version is simply this .

  • on the frontend: (userfacing ) use a database for the 1st level storage ( interfacing with user applications )
  • on the backend, use a version control system (VCS)(like git ) to perform versioning of the data objects in database

In essence it would amount to adding a version control plugin to the database, with some integration glue, which you may have to develop, but may be a lot much easier.

How it would (supposed to ) work is that the primary multi-user interface data exchanges are through the database. The DBMS will handle all the fun and complex issues such as multi-user , concurrency e, atomic operations etc. On the backend the VCS would perform version control on a single set of data objects ( no concurrency, or multi-user issues). For each effective transactions on the database, version control is only performed on the data records that would have effectively changed.

As for the interfacing glue, it will be in the form of a simple interworking function between the database and the VCS. In terms of design, as simple approach would be an event driven interface, with data updates from the database triggering the version control procedures ( hint : assuming Mysql, use of triggers and sys_exec() blah blah ...) .In terms of implementation complexity, it will range from the simple and effective ( eg scripting ) to the complex and wonderful ( some programmed connector interface) . All depends on how crazy you want to go with it , and how much sweat capital you are willing to spend. I reckon simple scripting should do the magic. And to access the end result, the various data versions, a simple alternative is to populate a clone of the database ( more a clone of the database structure) with the data referenced by the version tag/id/hash in the VCS. again this bit will be a simple query/translate/map job of an interface.

There are still some challenges and unknowns to be dealt with, but I suppose the impact, and relevance of most of these will largely depend on your application requirements and use cases. Some may just end up being non issues. Some of the issues include performance matching between the 2 key modules, the database and the VCS, for an application with high frequency data update activity, Scaling of resources (storage and processing power ) over time on the git side as the data , and users grow: steady, exponential or eventually plateau's

Of the cocktail above, here is what I'm currently brewing

  • using Git for the VCS ( initially considered good old CVS for the due to the use of only changesets or deltas between 2 version )
  • using mysql ( due to the highly structured nature of my data, xml with strict xml schemas )
  • toying around with MongoDB (to try a NoSQl database, which closely matches the native database structure used in git )

Some fun facts - git actually does clear things to optimize storage, such as compression, and storage of only deltas between revision of objects - YES, git does store only changesets or deltas between revisions of data objects, where is it is applicable ( it knows when and how) . Reference : packfiles, deep in the guts of Git internals - Review of the git's object storage ( content-addressable filesystem), shows stricking similarities ( from the concept perspective) with noSQL databases such mongoDB. Again, at the expense of sweat capital, it may provide more interesting possibilities for integrating the 2, and performance tweaking

If you got this far, let me if the above may be applicable to your case, and assuming it would be , how it would square up to some of the aspect in your last comprehensive performance analysis

Upvotes: 15

GreyCat
GreyCat

Reputation: 17104

Answering my own question is not the best thing to do, but, as I ultimately dropped the idea, I'd like to share on the rationale that worked in my case. I'd like to emphasize that this rationale might not apply to all cases, so it's up to architect to decide.

Generally, the first main point my question misses is that I'm dealing with multi-user system that work in parallel, concurrently, using my server with a thin client (i.e. just a web browser). This way, I have to maintain state for all of them. There are several approaches to this one, but all of them are either too hard on resources or too complex to implement (and thus kind of kill the original purpose of offloading all the hard implementation stuff to git in the first place):

  • "Blunt" approach: 1 user = 1 state = 1 full working copy of a repository that server maintains for user. Even if we're talking about fairly small document database (for example, 100s MiBs) with ~100K of users, maintaining full repository clone for all of them makes disc usage run through the roof (i.e. 100K of users times 100MiB ~ 10 TiB). What's even worse, cloning 100 MiB repository each time takes several seconds of time, even if done in fairly effective maneer (i.e. not using by git and unpacking-repacking stuff), which is non acceptable, IMO. And even worse — every edit that we apply to a main tree should be pulled to every user's repository, which is (1) resource hog, (2) might lead to unresolved edit conflicts in general case.

    Basically, it might be as bad as O(number of edits × data × number of users) in terms of disc usage, and such disc usage automatically means pretty high CPU usage.

  • "Only active users" approach: maintain working copy only for active users. This way, you generally store not a full-repo-clone-per-user, but:

    • As user logs in, you clone the repository. It takes several seconds and ~100 MiB of disc space per active user.
    • As user continues to work on the site, he works with the given working copy.
    • As user logs out, his repository clone is copied back to main repository as a branch, thus storing only his "unapplied changes", if there are any, which is fairly space-efficient.

    Thus, disc usage in this case peaks at O(number of edits × data × number of active users), which is usually ~100..1000 times less than number of total users, but it makes logging in/out more complicated and slower, as it involves cloning of a per-user branch on every login and pulling these changes back on logout or session expiration (which should be done transactionally => adds another layer of complexity). In absolute numbers, it drops 10 TiBs of disc usage down to 10..100 GiBs in my case, that might be acceptable, but, yet again, we're now talking about fairly small database of 100 MiBs.

  • "Sparse checkout" approach: making "sparse checkout" instead of full-blown repo clone per active user doesn't help a lot. It might save ~10x of disc space usage, but at expense of much higher CPU/disc load on history-involving operations, which kind of kills the purpose.

  • "Workers pool" approach: instead of doing full-blown clones every time for active person, we might keep a pool of "worker" clones, ready to be used. This way, every time a users logs in, he occupies one "worker", pulling there his branch from main repo, and, as he logs out, he frees the "worker", which does clever git hard reset to become yet again just a main repo clone, ready to be used by another user logging in. Does not help much with disc usage (it's still pretty high — only full clone per active user), but at least it makes logging in/out faster, as expense of even more complexity.

That said, note that I intentionally calculated numbers of fairly small database and user base: 100K users, 1K active users, 100 MiBs total database + history of edits, 10 MiBs of working copy. If you'd look at more prominent crowd-sourcing projects, there are much higher numbers there:

│              │ Users │ Active users │ DB+edits │ DB only │
├──────────────┼───────┼──────────────┼──────────┼─────────┤
│ MusicBrainz  │  1.2M │     1K/week  │   30 GiB │  20 GiB │
│ en.wikipedia │ 21.5M │   133K/month │    3 TiB │  44 GiB │
│ OSM          │  1.7M │    21K/month │  726 GiB │ 480 GiB │

Obviously, for that amounts of data/activity, this approach would be utterly unacceptable.

Generally, it would have worked, if one could use web browser as a "thick" client, i.e. issuing git operations and storing pretty much the full checkout on client's side, not on the server's side.

There are also other points that I've missed, but they're not that bad compared to the first one:

  • The very pattern of having "thick" user's edit state is controversial in terms of normal ORMs, such as ActiveRecord, Hibernate, DataMapper, Tower, etc.
  • As much as I've searched for, there's zero existing free codebase for doing that approach to git from popular frameworks.
  • There is at least one service that somehow manages to do that efficiently — that is obviously github — but, alas, their codebase is closed source and I strongly suspect that they do not use normal git servers / repo storage techniques inside, i.e. they basically implemented alternative "big data" git.

So, bottom line: it is possible, but for most current usecases it won't be anywhere near the optimal solution. Rolling up your own document-edit-history-to-SQL implementation or trying to use any existing document database would be probably a better alternative.

Upvotes: 75

Kombajn zbożowy
Kombajn zbożowy

Reputation: 10693

An interesting approach indeed. I would say that if you need to store data, use a database, not a source code repository, which is designed for a very specific task. If you could use Git out-of-the-box, then it's fine, but you probably need to build a document repository layer over it. So you could build it over a traditional database as well, right? And if it's built-in version control that you're interested in, why not just use one of open source document repository tools? There are plenty to choose from.

Well, if you decide to go for Git backend anyway, then basically it would work for your requirements if you implemented it as described. But:

1) You mentioned "cluster of servers that push/pull each other" - I've thought about it for a while and still I'm not sure. You can't push/pull several repos as an atomic operation. I wonder if there could be a possibility of some merge mess during concurrent work.

2) Maybe you don't need it, but an obvious functionality of a document repository you did not list is access control. You could possibly restrict access to some paths(=categories) via submodules, but probably you won't be able to grant access on document level easily.

Upvotes: 12

Related Questions