Reputation: 201

Billions rows mysql table / cassandra model - (and a bit of biology :))

I'm working on a biology lab and I have to design a database in order to store many DNA microarray experiments results.

Each experiment is componed of many microarrays (about ten in average) and each microarray contains over 5 millions probes. Each probe is mapped to a particular gene id, of course the same probe match the same gene_id in all the experiments. The aim is to store the intensity values of each microarray in order to be able to quickly retrieve the intensity values of the probes of a particular gene id in a particular experiment.

In fact a simple mysql table would be enough, it would look like that :

With a primary key composed of (probe_id, experiment_id, microarray_id, gene_id)

Here's the problem : each experiment has many microarray which has over 5 millions probes. With 1000 experiments, 10 microarrays on average (wich is a low estimation, some have hundreds), its 1000 * 10 * 5M = 50 Billions rows. I guess it would be slow. And I have absolutely no idea about how to handle a billions rows mysql table. Is that possible ? Any tips ?

I'm also curious about noSQL databases. I never used cassandra but it seems to me that it would be perfect for this task, am I right ? I can imagine a shema like this :

{
experiment_id_1:{ <- thats a super collumnFamilly ?
    gene_id_1:{ <- thats a collumnFamilly ?
        probe_id_1:{ value_microarray_1, value_microarray_2, ... }, <- thats a superCollumn ?
        probe_id_2:{ value_microarray_1, value_microarray_2, ... },
        probe_id_3:{ value_microarray_1, value_microarray_2, ... },
        ...
    },
    gene_id_2:{
        probe_id_1:{ value_microarray_1, value_microarray_2, ... },
        probe_id_2:{ value_microarray_1, value_microarray_2, ... },
        probe_id_3:{ value_microarray_1, value_microarray_2, ... },
        ...
    }
}
experiment_id_2{
    ...
}
...
}

Am I rigth ? Would it fit the cassandra model ? Would it be efficient ? What do you think noSQL guru :)

Thanks.

Upvotes: 5

Answers (6)

user9016329

Reputation:

I am familiar with the model because this was one of the proposal for my PhD back in 2010. However, I was looking at ~80000 experiments (which is only a fraction of Gene Expression Omnibus database).

You have a typical n:m cardinality. If your requirement covers microarrays of the same kind, three tables are needed: probe (list of probes provided by your microarray manufacturer), assay (one experiment involving one microarray) and measures (key of probe, key of experiment, raw value, normalized value, p-value, etc).

NoSQL will work if your queries go either by experiment or by probe. This means, getting all the values for a probe (across all experiments) is mutually exclusive with getting all the probes of an experiment. If you want to go with any NoSQL tool (Cassandra included) you will need to store the data twice.

On the other hand, the volume is not so large to prevent storing it on one single node. For example, 80k experiments X 5M probes X 100bytes per record = 40TB which you can cover by having a RAID6 system of 12x6GTB = 60TB (2 disks subtracted for redundancy).

Since the odds are that you will either need an experiment or all the values of the probe at once, another option is to go with flat files. The first set is your input data, the second set you have to compile, by querying your database. To avoid extended disk seeks, storing data in physically two structures may be the only way to succeed.

Upvotes: 0

Tyler Hobbs

Reputation: 6932

MySQL or Postgres might work out fine for you, and the other answers have given you some good tips on how to do that. But, since you specifically asked about Cassandra as well, here are my thoughts:

Cassandra would work well for this. If you want to be able to efficiently look up all intensity values for an experiment/gene_id combination, I suggest something slightly different than what you've come up with. Use a composite key like (<experiment_id>, <gene_id>) (or just a string like "<experiment_id>:<gene_id>" if you want to keep it simple), and use one column per intensity value in this row. This will let you fetch all of the intensity values you need very efficiently; typically one or two disk seeks for a cold lookup.

Upvotes: 1

Flo Doe

Reputation: 5381

I would consider in this approach also a relational over a NoSQL database. If you make some considerations, you would be able to check if you are able to handle the data:

What is the expectet size of the table, to get a rough idea check the size of one set of data and calculate the overall size by multiplying it with the total expecet number of datasets.
Calculate the size of your indexes
Now check, if your server is able to handle those indexes in RAM or better, handle the whole table in RAM.
What is the ratio between DML/selecting operations on this table
Make sure you have strategies in place how you will deal with common tasks like backuping, optimizing, altering those kind of tables.

If I have to deal which such kind of situations, I usually generate some test data which is simmilar to the one i expect to have in my table and play arround with different server parameters. Also I consider using partitioning of tables in this case (for example partition over the experiment_id. This will result in table splitted to smaller subsets, which can be coped with in means of existing hardware boundaries. Don´t you dare to make this by yourself, MySQL can do this for you and the table will be presented as single table to the user. But the machine only has to deal with the part, where the datasets for a given experiment_id are stored. This results in much faster I/O etc.

I already have seen machines which deal with tables with much more than your expected row count easily, but you have to plan such setups carefully and it usually requires a lot of testing/optimizing/redesigning before you can go in production with it. But it is always worth to take this effort, since it is a quite interesting thing to deal with.

(I made my first experiences in this field while dealing with embl data during my study times, and it became my passion ;))

Upvotes: 2

Jé Queue

Reputation: 10637

An RDBMS should not choke with that volume at all. Your data is structured enough that it makes enough sense to put into relations.

MySQL depending on your storage can handle this. I might recommend table partitioning by putting them into separate tables, just from a storage mangement position.

Upvotes: 1

maniek

Reputation: 7307

Consider this:

Have a table for each experiment, which has columns (probe_id, gene_id, array_of_values). If I understood You correctly, the primary key would be on probe_id (but If You don't query this column, You could just not have a primary key). Also, You need an index on gene_id.

Thus, You have 1000 tables of manageable 5M rows each. good or not? Does this fit Your query patterns? A neat property of this scheme is that it makes it easy to drop old data.

BTW, if You consider postgresql instead of mysql, it has native array types. Otherwise You should figure out an efficient method of serializing the arrays.

Anyway, this should be easy to test.

Upvotes: 1

Jason S

Reputation: 189816

Maybe I'm missing something, but you have a system that sounds like it is:

homogeneous: every entry in the database has an experiment ID, a gene ID, a probe ID, a value selector ID (which element of the microarray it is), and a value.
write-once, read-many: you are going to record information and once it is recorded, you never want to change it again.

This sounds more like it is a good candidate for a relational database (MySQL or PostreSQL) rather than a NoSQL database -- NoSQL databases are much better at handling heterogeneous databases.

Upvotes: 0

Billions rows mysql table / cassandra model - (and a bit of biology :))

Answers (6)

Related Questions