FBB
FBB

Reputation: 1475

Alternative to database design with enumerated columns, leading to poor performances

Following remarks on my previous question, I describe here the problem that led me to have a database schema with enumerated columns, leading to poor performances.

(see edit at the bottom of this question for my overall conclusion)

I work with gene expression data. We capture in which conditions any gene is expressed (for instance, to say that gene X is expressed in the condition[organ Y - life stage Z]). I have 4 dataTypes that can produce such expression data. So my raw data are stored in different tables, for instance (this is just an illustrative example, the raw data are far more complex):

+--------------------+------------------------------------+------+-----+--------------+-------+
| Field              | Type                               | Null | Key | Default      | Extra |
+--------------------+------------------------------------+------+-----+--------------+-------+
| geneId             | int(10) unsigned                   | NO   | PRI | NULL         |       |
| evidenceId         | varchar(70)                        | NO   | PRI | NULL         |       |
| experimentId       | varchar(70)                        | NO   | MUL | NULL         |       |
| conditionId        | mediumint(8) unsigned              | NO   | MUL | NULL         |       |
| expressionId       | int(10) unsigned                   | NO   | MUL | NULL         |       |
| detectionFlag      | enum('expressed', 'not expressed') | NO   |     | NULL         |       |
| quality            | enum('low quality','high quality') | NO   |     | NULL         |       |
+--------------------+------------------------------------+------+-----+--------------+-------+

I have one such table per dataType. Now, typical queries will request thousands of genes at a same time. Because the data are very large (several hundreds of millions of rows in each table), and includes redundant values (tons of evidence for a same gene, tons of genes for a same evidence), it is very slow to query each table individually. For this reason, we have a precomputed "summary" table, computed from the information in these 4 tables:

+----------------+-----------------------+------+-----+---------+----------------+
| Field          | Type                  | Null | Key | Default | Extra          |
+----------------+-----------------------+------+-----+---------+----------------+
| expressionId   | int(10) unsigned      | NO   | PRI | NULL    | auto_increment |
| geneId         | int(10) unsigned      | NO   | MUL | NULL    |                |
| conditionId    | mediumint(8) unsigned | NO   | MUL | NULL    |                |
+----------------+-----------------------+------+-----+---------+----------------+

(Note that there are other useful columns in this table). The expressionId field allows to get back to the raw data.

Now my problem is:

So I ended up with the following design:

+--------------------------+-----------------------+------+-----+---------+----------------+
| Field                    | Type                  | Null | Key | Default | Extra          |
+--------------------------+-----------------------+------+-----+---------+----------------+
| expressionId             | int(10) unsigned      | NO   | PRI | NULL    | auto_increment |
| geneId                   | int(10) unsigned      | NO   | MUL | NULL    |                |
| conditionId              | mediumint(8) unsigned | NO   | MUL | NULL    |                |
| dataType1ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
| dataType2ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
| dataType3ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
| dataType4ExperimentCount | smallint(5) unsigned  | NO   |     | 0       |                |
+--------------------------+-----------------------+------+-----+---------+----------------+

Rows in this table are precomputed by taking into account all dataTypes and all related conditions of a given conditionId. This is very slow to compute. The table has hundreds of millions of rows as a result.

Now my queries look like:

SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount + dataType3ExperimentCount + dataType4ExperimentCount) >= ?;
SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount) >= ?;

The performances are very bad, because such queries can't use indexes, according to answers in my previous question. I need to allow any combination of dataTypes. I need to allow addition of new dataTypes in the future (thus making the number of combinations reaching 32 or 64 very fast).

What better design could I come up with?

Edit following request of user Rick James, the show create table:

CREATE TABLE `expression` (
  `expressionId` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `geneId` mediumint(8) unsigned NOT NULL,
  `conditionId` mediumint(8) unsigned NOT NULL,
  `dataType1ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  `dataType2ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  `dataType3ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  `dataType4ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`expressionId`),
  UNIQUE KEY `geneId` (`geneId`,`conditionId`),
  KEY `conditionId` (`conditionId`),
  CONSTRAINT `expression_ibfk_1` FOREIGN KEY (`geneId`) REFERENCES `gene` (`geneId`) ON DELETE CASCADE,
  CONSTRAINT `expression_ibfk_2` FOREIGN KEY (`conditionId`) REFERENCES `cond` (`conditionId`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

(and, yes, a given geneId has less rows in table than a given conditionId, so the multiple unique key is properly ordered).

Edit, overall conclusion:

Upvotes: 2

Views: 111

Answers (2)

Rick James
Rick James

Reputation: 142453

Instead of

PRIMARY KEY (`expressionId`),
UNIQUE KEY `geneId` (`geneId`,`conditionId`),

use

PRIMARY KEY(`geneId`,`conditionId`),
INDEX (`expressionId`),

If no other tables are refencing expressionId, get rid of that column and the index on it.

Why does this help? The data is clustered with the Primary key; you are looking up data by geneId, which is the start of the PK; hence the data can be fetched more efficiently, especially if the table is a lot bigger than innodb_buffer_pool_size (which should be about 70% of RAM).

Upvotes: 1

Łukasz Kamiński
Łukasz Kamiński

Reputation: 5940

From previous post:

Maybe try column store engine for MySQL? Like ICE or InfiniDB. You wouldn't need indexes, because they store data similar to row-based storage indexes. This type of storage works faster for certain use cases, while slower for others. Data warehouses, aggregates, analytic queries and such should benefit.

There are community versions as well as paid enterprise editions.

Upvotes: 1

Related Questions