Reputation: 8999
I have about 30 MB of textual data that is core to the algorithms I use in my web application.
On the one hand, the data is part of the algorithm and changes to the data can cause an entire algorithm to fail. This is why I keep the data in text files in my source control, and all changes are auto-tested (pre-commit). I currently have a good level of control. Distributing the data along with the source as we spawn more web instances is a non-issue because it tags along with the source. I currently have these problems:
On the other hand, it is data, and it "belongs" in a database. I wish I could place it in a database, but then I would have these problems:
Things I have considered thus-far:
BTW: I have a "regular" database as well, with things that are not algorithmic-data.
BTW2: I added the Python tag because I currently use Python, Django, Apache and Nginx, Linux (and some lame developers use Windows).
Thanks in advance!
UPDATE
Some examples of the data (the algorithms are natural language processing stuff):
The list goes on and on, but Imagine trying to parse the sentence Romantic hotel for 2 in Rome arriving in Italy next monday
if someone changes the coordinates that teach me that Rome is in Italy, or if someone adds `romantic' as an alternate name for Las Vegas (yes, the example is lame, but I hope you get the drift).
Upvotes: 4
Views: 143
Reputation: 1121554
I'd term this a resource, which is data that your application relies on, but not the data your application manages. Images, CSS, and templates are similar resources, and you keep them all version controlled.
In this case, you could split out your data into a separate package. In python distribution terms, use a separate egg that your application depends on; package deployment tools such as pip and buildout will pull in the dependency automatically. That way you can version it independently, you can have other tools depend on it, etc.
Last, choose a format that can be managed effectively by a source control system. That means a textual format. You can initiate parse that format on start-up, but by keeping it textual you can manage it properly through changes to it. This could be a SQL dump (CREATE TABLE and INSERT statements) to be loaded into a sqlite database on start-up, or some other python-based structure. You can also choose to load data on demand, caching it in-memory as needed.
Have a look at the pytz timezone database for a great example of how another resource project manages such structures.
Upvotes: 1
Reputation: 363547
Okay, here's an idea:
Alternatively, this route may be easier, esp. when upgrading an installation:
In either case, you can keep your current testing code; you just need to add tests that make sure the database properly overrides text files.
Upvotes: 1