cespinoza
cespinoza

Reputation: 1457

Using a temporary database as an intermediate store in a pipeline?

I have a bioinformatics analysis program that is composed of 5 different steps. Each step is essentially a perl script that takes in input, does magic, and output several text files. Each step needs to be completely finished before the next starts. The entire process takes 24 hours or so on core i7 computers.

One major problem is that each step produces about 5-10 gigabytes of intermediate output text files needed by subsequent steps, and there's a bunch of redundancy. For example, the output of step 1 is used by step 2 and 3 and 4, and each one does the same preprocessing to it. This structure grew 'organically' b/c each step was developed independently. Doing everything in memory unfortunately will not work for us since data that is 10 gigs on-disk loaded into a perl hash/array is way too big for fit into memory.

It would be nice if the data could be loaded onto an intermediate database, processed once in a step, and be available in all subsequent steps. The data is essentially relational/tabular. Some of the steps only need access to data sequentially, while others need random access to files.

Does anyone have any experience in this sort of thing?

Which database would be right for such a task? I have used and liked SQLite, but does it scale to 20GB+ sizes? Can you tell postgresql or mysql to heavily cache data in memory? (I figure that databases written in C/C++ would be much more efficient memory-wise than perl hashes/arrays, so most of it could be cached in memory on 24GB machine). Or is there a better, non-rdbms related solution, given the overhead of creating, indexing, and subsequently destroying 20GB+ in a RDBMS for single-run analyses?

Upvotes: 1

Views: 214

Answers (1)

Mikos
Mikos

Reputation: 8553

Have you looked at some of the NoSQL databases? They seem suited to your kind of work. I have used MongoDB for a high throughput application.

Here is a comparison of various nosql dbs.

Upvotes: 1

Related Questions