Reputation: 3996

How do you implement version control in a database application?

I'm working on a web based Java project that stores end user data in a MySql database. I'd like to implement something that allows the user to have functionality similar to what I have for my source code version control (e.g. Subversion). In other words, I'd like to implement code that allows the user to commit and rollback work and return to an existing branch. Is there an existing framework for this? It seems like putting the database data into version control and exposing the version control functionality to the end user (i.e. write code that allows the user to commit, rollback, etc.) could be a reasonable approach but it also seems their might be some problems with this approach. For example, how would you allow one user to view a rolled back version of the data (i.e. you can't just replace the data the database is pointing to if one user wants to look at a rolled back version of the data)? If given the choice of completely rebuilding the system using any persistence architecture what could be used to store the data that would make this type of functionality easy to implement?

Upvotes: 2

Answers (4)

baykam

Reputation: 11

It seems like you want version control for your data rather than the database schema. I could find two databases that implement most of the version control features such as fork, clone, branch, merge, push, and pull:

https://github.com/dolthub/dolt - SQL based
https://github.com/terminusdb/terminusdb - graph based

Upvotes: 1

yorammi

Reputation: 6458

There are 2 very common solutions for what you need:

Upvotes: 3

Lucas

Reputation: 258

Branching and merging the user data

Your question is about solutions to version the user data in a application, to give your users capabilities such as branching and merging. You pondered about exposing a real version control such as svn.

The side-effects I can foresee are:

You will have to index things by directory and filename. Maybe using an abstraction of directories as entities and filenames as the primary key.
Operating systems (linux, mac and windows alike) does not handle well directories with millions of files. You will have to partition the entity. Usually hashing the ID (md5 for example) and taking the beginning of the hash to create an subdirectory. The number of digits to take from the hash depends on the expected size of the entity.
Operating systems (linux, mac and windows alike) are not prepared for huge quantity of files. I did a test on that. It took me days to backup and finally remove an file tree with hundreds of millions of files.
You will not be able to have additional indexes beyond the primary key, however you can work around that creating a data-mart, as I will describe below.
You will not have database constraints, but similar functionality can be implemented through git/svn/cvs triggers.
You will not have strong transactions, but similar functionality can be implemented through git/svn/cvs triggers.
You will have a working copy for each user, this will consume space depending on the size of the repositories. That way each user will be in a single point in time.
GIT is fast enough to switch from a branch to another, so go back in time and back will take only seconds (unless the user data is big, of course).
I saw a Linus interview where he warned about low performance in huge git repositories. Maybe it is best to have a repository to each user or other means to avoid your application having a single humongous repository.
Resolution of the changes. I bet that if you create gazillions of versions any version control will complaint. I do not what gazillions mean. You will have to test it.

Query database

A version control working copy will be limited to primary key queries using the "=" operator and sequential scans. This is not enough to make good reports and statistics on any usage pattern I can think off. That why you need to build a data-mart from your application data and you have two ways of doing that:

A batch process: that reads the whole repository history and builds cubes and other views to allow easier querying.
GIT/SVN/CVS triggers: can call programs made by you on file addition, modification, exclusion, branch creation and merging. This could be used to update the database when a change happen.

The batch is easier to implement but takes time to the reports and statistics be synchronized with the activity. You probably will want to go that way in the 1.0 version and in time moving to triggers to get things more dynamic.

Simulating constraints and transactions

GIT, SVN and CVS supports triggers that execute programs when a new version is submitted. Then the relationships and consistency can be checked to accept or not the change.

Alternative Solutions

Since you do not specified the kind of application you want, I will talk about blogs, content portals and online stores. For those kinds of applications I see no much reason to reinvent the wheel and build a custom database. Most of the versioning necessary can be predicted in the database model. A good event-oriented database design will be enough.

For example, a revision in a blog post could be modeled as marking the end date/time of the post and creating a new row for the revised post, increasing the version number and setting the previous version id. The same strategy can be used with sales and catalog of an online store. If you model your application with good logs you does not need version control.

Some developers also do a row level trigger that records everything that has changed on the database. This is a bit harder for an auditor that would need to reconstruct the past from bad designed logs. I personally do not like this way because is very difficult to index this kinds of queries. I prefer to make my whole applications around a good designed and meaningful log.

For example:

History Table
10/10/2010 [new process] process_id=1; name=john
11/10/2010 [change name] process_id=1; old_name=john; new_name=john doe
12/10/2010 [change name] process_id=1; old_name=john doe; new_name=john doe junior

Process Table after 12/10/2010.
proc_id=1 name=john doe junior

That way I can reconstruct almost everything on the past and still have my operational data in a easy-to-use format.

However, this is not close to the usage pattern you want (branching and merging)

Conclusion

The applicability of version control as a database seems to me very powerful on one hand and very limited and dangerous in another. It is very inspiring for auditing and error correction purposes. But my main concern would be scale and reliability.

Upvotes: 1

xnakos

Reputation: 10196

You mentioned Subversion, which is a Centralized Version Control System. But let us focus on Git, because of reasons. Git is a Decentralized Version Control System. A local copy of a Git repository is the same as a remote copy of the repository, if a remote copy exists at all (services such as GitLab and GitHub provide the remote housing and managing of Git projects). With Git you can have version control in an arbitrary directory in your machine. You can do whatever you are accustomed to doing with SVN, and more, in this arbitrary directory.

What I am getting at, is that you could possibly create per user directories/repositories in your server programmatically, and apply version control in these directories/repositories, keeping a separate repository per user (the specifics of the architecture would be decided later, though, depending on the structure of the user's "work"). Your application would be in charge of adding and removing files on behalf of the user (e.g. Biography, My Sample Project, etc.), editing files, committing the changes, presenting a file history, etc., essentially issuing Git commands. Your application would, thus, interface with the Git repository, exploiting the advanced version control that Git provides. Your database would just make sure that the user is linked to the directory/repository that contains their "work".

To provide a critical analogy, the GitLab project is an open source web-based Git repository manager with wiki and issue tracking features. GitLab is written in Ruby and uses PostgreSQL (preferably). It is a typical (as in Code - Database - Data directories and files) multiuser web-based application. Its purpose is to manage Git repositories. These Git repositories are stored in a designated directory in the server. Part of the code is responsible for accessing the Git repositories that the logged-in user is authorized to access (as the owner or as a collaborator). An interesting use case is of a user editing a file online, which will result in a commit in some branch in some repository. Another interesting use case is of a user checking the history of a file. A final interesting use case is of a user reverting a specific commit. All of these actions are performed online, via a web browser.

To provide an interesting real-world use case, Atlas by O'Reilly is an online platform for publishing-related collaboration using GitLab as the backend.

For Java there is JGit, a lightweight, pure Java library implementing the Git version control system. JGit is used by Eclipse for all actions related to managing Git repositories. Maybe you could look into it. It is an extremely active project, supported by many, Google included.

All of the above make sense, if the "work" you refer to is more than some fields in a database table, which the user will fill in and may later change the values of. For instance, it would make sense for structured text, HTML, etc.

If this "work" is not so large-scale, maybe doing something like what is described above is overkill. In that case, you could employ some of the version control concepts in your database design, such as calculating diffs and applying patches (also in reverse, for viewing past versions / rolling back). Your tables should allow for a tree-like structure, to store the diffs, so you could allow for branches. You could have the active version of a file readily available, as well as the active index (what Git calls HEAD), and navigate to another indexed/hashed/tagged version in the file's history by applying all patches sequentially, if moving forward, or applying patches in reverse, and in the reverse chronological order, if moving backwards. If this "work" is really small-scale, you could even ditch the diff concept, and store the whole version of the "work" in the tree-like structure.

Pure fun.

Upvotes: -1