Tal Weiss
Tal Weiss

Reputation: 8999

What's the best way to handle source-like data files in a web application?

I have about 30 MB of textual data that is core to the algorithms I use in my web application.

On the one hand, the data is part of the algorithm and changes to the data can cause an entire algorithm to fail. This is why I keep the data in text files in my source control, and all changes are auto-tested (pre-commit). I currently have a good level of control. Distributing the data along with the source as we spawn more web instances is a non-issue because it tags along with the source. I currently have these problems:

On the other hand, it is data, and it "belongs" in a database. I wish I could place it in a database, but then I would have these problems:

Things I have considered thus-far:

BTW: I have a "regular" database as well, with things that are not algorithmic-data.
BTW2: I added the Python tag because I currently use Python, Django, Apache and Nginx, Linux (and some lame developers use Windows).

Thanks in advance!

UPDATE

Some examples of the data (the algorithms are natural language processing stuff):

The list goes on and on, but Imagine trying to parse the sentence Romantic hotel for 2 in Rome arriving in Italy next monday if someone changes the coordinates that teach me that Rome is in Italy, or if someone adds `romantic' as an alternate name for Las Vegas (yes, the example is lame, but I hope you get the drift).

Upvotes: 4

Views: 143

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121554

I'd term this a resource, which is data that your application relies on, but not the data your application manages. Images, CSS, and templates are similar resources, and you keep them all version controlled.

In this case, you could split out your data into a separate package. In python distribution terms, use a separate egg that your application depends on; package deployment tools such as pip and buildout will pull in the dependency automatically. That way you can version it independently, you can have other tools depend on it, etc.

Last, choose a format that can be managed effectively by a source control system. That means a textual format. You can initiate parse that format on start-up, but by keeping it textual you can manage it properly through changes to it. This could be a SQL dump (CREATE TABLE and INSERT statements) to be loaded into a sqlite database on start-up, or some other python-based structure. You can also choose to load data on demand, caching it in-memory as needed.

Have a look at the pytz timezone database for a great example of how another resource project manages such structures.

Upvotes: 1

Fred Foo
Fred Foo

Reputation: 363547

Okay, here's an idea:

  1. Ship all the data as is done now.
  2. Have the installation script install it in the appropriate databases.
  3. Let users modify this database and give them a button "restore to original" that simply reinstalls from the text file.

Alternatively, this route may be easier, esp. when upgrading an installation:

  1. Ship all the data as is done now.
  2. Let users modify the data and store the modified versions in the appropriate database.
  3. Let the code look in the database, falling back to the text files if the appropriate data is not found. Don't let the code modify the text files in any way.

In either case, you can keep your current testing code; you just need to add tests that make sure the database properly overrides text files.

Upvotes: 1

Related Questions