Alternative to database design with enumerated columns, leading to poor performances

Question

Following remarks on my previous question, I describe here the problem that led me to have a database schema with enumerated columns, leading to poor performances.

(see edit at the bottom of this question for my overall conclusion)

I work with gene expression data. We capture in which conditions any gene is expressed (for instance, to say that gene X is expressed in the condition[organ Y - life stage Z]). I have 4 dataTypes that can produce such expression data. So my raw data are stored in different tables, for instance (this is just an illustrative example, the raw data are far more complex):

+--------------------+------------------------------------+------+-----+--------------+-------+
| Field              | Type                               | Null | Key | Default      | Extra |
+--------------------+------------------------------------+------+-----+--------------+-------+
| geneId             | int(10) unsigned                   | NO   | PRI | NULL         |       |
| evidenceId         | varchar(70)                        | NO   | PRI | NULL         |       |
| experimentId       | varchar(70)                        | NO   | MUL | NULL         |       |
| conditionId        | mediumint(8) unsigned              | NO   | MUL | NULL         |       |
| expressionId       | int(10) unsigned                   | NO   | MUL | NULL         |       |
| detectionFlag      | enum('expressed', 'not expressed') | NO   |     | NULL         |       |
| quality            | enum('low quality','high quality') | NO   |     | NULL         |       |
+--------------------+------------------------------------+------+-----+--------------+-------+

I have one such table per dataType. Now, typical queries will request thousands of genes at a same time. Because the data are very large (several hundreds of millions of rows in each table), and includes redundant values (tons of evidence for a same gene, tons of genes for a same evidence), it is very slow to query each table individually. For this reason, we have a precomputed "summary" table, computed from the information in these 4 tables:

+----------------+-----------------------+------+-----+---------+----------------+
| Field          | Type                  | Null | Key | Default | Extra          |
+----------------+-----------------------+------+-----+---------+----------------+
| expressionId   | int(10) unsigned      | NO   | PRI | NULL    | auto_increment |
| geneId         | int(10) unsigned      | NO   | MUL | NULL    |                |
| conditionId    | mediumint(8) unsigned | NO   | MUL | NULL    |                |
+----------------+-----------------------+------+-----+---------+----------------+

(Note that there are other useful columns in this table). The expressionId field allows to get back to the raw data.

Now my problem is:

for each data type, we compute a summaryQuality, based on the number of distinct experiments supporting an expression row, in the condition itself, but also taking into account any related condition (I let out what is a related condition, but, yes, relations between conditions could be stored in another table).
Users should be able to compute a "global" summaryQuality, by summing experiments supporting an expression row for any combination of dataTypes. E.g., they should be able to say "give me results supported by x experiments from the sum of experiments in dataType1 and dataType2", or "give me results supported by y experiments from the sum of experiments in dataType1 and dataType2 and dataType3 and dataType4".

So I ended up with the following design:

+--------------------------+-----------------------+------+-----+---------+----------------+
| Field                    | Type                  | Null | Key | Default | Extra          |
+--------------------------+-----------------------+------+-----+---------+----------------+
| expressionId             | int(10) unsigned      | NO   | PRI | NULL    | auto_increment |
| geneId                   | int(10) unsigned      | NO   | MUL | NULL    |                |
| conditionId              | mediumint(8) unsigned | NO   | MUL | NULL    |                |
| dataType1ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
| dataType2ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
| dataType3ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
| dataType4ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
+--------------------------+-----------------------+------+-----+---------+----------------+

Rows in this table are precomputed by taking into account all dataTypes and all related conditions of a given conditionId. This is very slow to compute. The table has hundreds of millions of rows as a result.

Now my queries look like:

SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount + dataType3ExperimentCount + dataType4ExperimentCount) >= ?;
SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount) >= ?;

The performances are very bad, because such queries can't use indexes, according to answers in my previous question. I need to allow any combination of dataTypes. I need to allow addition of new dataTypes in the future (thus making the number of combinations reaching 32 or 64 very fast).

What better design could I come up with?

Edit following request of user Rick James, the show create table:

CREATE TABLE `expression` (
  `expressionId` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `geneId` mediumint(8) unsigned NOT NULL,
  `conditionId` mediumint(8) unsigned NOT NULL,
  `dataType1ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  `dataType2ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  `dataType3ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  `dataType4ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`expressionId`),
  UNIQUE KEY `geneId` (`geneId`,`conditionId`),
  KEY `conditionId` (`conditionId`),
  CONSTRAINT `expression_ibfk_1` FOREIGN KEY (`geneId`) REFERENCES `gene` (`geneId`) ON DELETE CASCADE,
  CONSTRAINT `expression_ibfk_2` FOREIGN KEY (`conditionId`) REFERENCES `cond` (`conditionId`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

(and, yes, a given geneId has less rows in table than a given conditionId, so the multiple unique key is properly ordered).

Edit, overall conclusion:

@RickJame's answer made my queries run 4 to 5 times faster, they now run in a reasonable time. Problem solved for now.
But @Strawberry is right that my design could be improved (see comments of this question)
But on MySQL, the "correct" design makes my queries run 10 times slower. I think this is because MySQL is a row-based database, well adapted at retrieving all information in multiple columns on a single row, as in my current "enumerated" design.
I think that the long term solution would be to switch to a column-based database, as proposed in @ŁukaszKamiński's answer, using the correct design proposed by @Strawberry. Because then, the information to retrieve would be in several rows, but a same column.

Rick James · Accepted Answer

Instead of

PRIMARY KEY (`expressionId`),
UNIQUE KEY `geneId` (`geneId`,`conditionId`),

use

PRIMARY KEY(`geneId`,`conditionId`),
INDEX (`expressionId`),

If no other tables are refencing expressionId, get rid of that column and the index on it.

Why does this help? The data is clustered with the Primary key; you are looking up data by geneId, which is the start of the PK; hence the data can be fetched more efficiently, especially if the table is a lot bigger than innodb_buffer_pool_size (which should be about 70% of RAM).

Alternative to database design with enumerated columns, leading to poor performances

Answers (2)

Related Questions