Reputation: 1475
Following remarks on my previous question, I describe here the problem that led me to have a database schema with enumerated columns, leading to poor performances.
(see edit at the bottom of this question for my overall conclusion)
I work with gene expression data. We capture in which condition
s any gene
is expressed (for instance, to say that gene X is expressed in the condition[organ Y - life stage Z]). I have 4 dataType
s that can produce such expression data. So my raw data are stored in different tables, for instance (this is just an illustrative example, the raw data are far more complex):
+--------------------+------------------------------------+------+-----+--------------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------------------------+------+-----+--------------+-------+
| geneId | int(10) unsigned | NO | PRI | NULL | |
| evidenceId | varchar(70) | NO | PRI | NULL | |
| experimentId | varchar(70) | NO | MUL | NULL | |
| conditionId | mediumint(8) unsigned | NO | MUL | NULL | |
| expressionId | int(10) unsigned | NO | MUL | NULL | |
| detectionFlag | enum('expressed', 'not expressed') | NO | | NULL | |
| quality | enum('low quality','high quality') | NO | | NULL | |
+--------------------+------------------------------------+------+-----+--------------+-------+
I have one such table per dataType
. Now, typical queries will request thousands of genes at a same time. Because the data are very large (several hundreds of millions of rows in each table), and includes redundant values (tons of evidence for a same gene
, tons of gene
s for a same evidence), it is very slow to query each table individually. For this reason, we have a precomputed "summary" table, computed from the information in these 4 tables:
+----------------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-----------------------+------+-----+---------+----------------+
| expressionId | int(10) unsigned | NO | PRI | NULL | auto_increment |
| geneId | int(10) unsigned | NO | MUL | NULL | |
| conditionId | mediumint(8) unsigned | NO | MUL | NULL | |
+----------------+-----------------------+------+-----+---------+----------------+
(Note that there are other useful columns in this table). The expressionId
field allows to get back to the raw data.
Now my problem is:
summaryQuality
, based on the number of distinct experiments supporting an expression row, in the condition
itself, but also taking into account any related condition
(I let out what is a related condition
, but, yes, relations between condition
s could be stored in another table).summaryQuality
, by summing experiments supporting an expression row for any combination of dataType
s. E.g., they should be able to say "give me results supported by x experiments from the sum of experiments in dataType1 and dataType2", or "give me results supported by y experiments from the sum of experiments in dataType1 and dataType2 and dataType3 and dataType4".So I ended up with the following design:
+--------------------------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+-----------------------+------+-----+---------+----------------+
| expressionId | int(10) unsigned | NO | PRI | NULL | auto_increment |
| geneId | int(10) unsigned | NO | MUL | NULL | |
| conditionId | mediumint(8) unsigned | NO | MUL | NULL | |
| dataType1ExperimentCount | smallint(5) unsigned | NO | | 0 | |
| dataType2ExperimentCount | smallint(5) unsigned | NO | | 0 | |
| dataType3ExperimentCount | smallint(5) unsigned | NO | | 0 | |
| dataType4ExperimentCount | smallint(5) unsigned | NO | | 0 | |
+--------------------------+-----------------------+------+-----+---------+----------------+
Rows in this table are precomputed by taking into account all dataType
s and all related condition
s of a given conditionId
. This is very slow to compute. The table has hundreds of millions of rows as a result.
Now my queries look like:
SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount + dataType3ExperimentCount + dataType4ExperimentCount) >= ?;
SELECT * FROM myTable WHERE geneId IN (?, ?, ?, ...) AND (dataType1ExperimentCount + dataType2ExperimentCount) >= ?;
The performances are very bad, because such queries can't use indexes, according to answers in my previous question. I need to allow any combination of dataType
s. I need to allow addition of new dataType
s in the future (thus making the number of combinations reaching 32 or 64 very fast).
What better design could I come up with?
Edit following request of user Rick James, the show create table:
CREATE TABLE `expression` (
`expressionId` int(10) unsigned NOT NULL AUTO_INCREMENT,
`geneId` mediumint(8) unsigned NOT NULL,
`conditionId` mediumint(8) unsigned NOT NULL,
`dataType1ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
`dataType2ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
`dataType3ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
`dataType4ExperimentCount` smallint(5) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`expressionId`),
UNIQUE KEY `geneId` (`geneId`,`conditionId`),
KEY `conditionId` (`conditionId`),
CONSTRAINT `expression_ibfk_1` FOREIGN KEY (`geneId`) REFERENCES `gene` (`geneId`) ON DELETE CASCADE,
CONSTRAINT `expression_ibfk_2` FOREIGN KEY (`conditionId`) REFERENCES `cond` (`conditionId`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
(and, yes, a given geneId
has less rows in table than a given conditionId
, so the multiple unique key is properly ordered).
Edit, overall conclusion:
Upvotes: 2
Views: 111
Reputation: 142453
Instead of
PRIMARY KEY (`expressionId`),
UNIQUE KEY `geneId` (`geneId`,`conditionId`),
use
PRIMARY KEY(`geneId`,`conditionId`),
INDEX (`expressionId`),
If no other tables are refencing expressionId
, get rid of that column and the index on it.
Why does this help? The data is clustered with the Primary key; you are looking up data by geneId
, which is the start of the PK; hence the data can be fetched more efficiently, especially if the table is a lot bigger than innodb_buffer_pool_size
(which should be about 70% of RAM).
Upvotes: 1
Reputation: 5940
From previous post:
Maybe try column store engine for MySQL? Like ICE or InfiniDB. You wouldn't need indexes, because they store data similar to row-based storage indexes. This type of storage works faster for certain use cases, while slower for others. Data warehouses, aggregates, analytic queries and such should benefit.
There are community versions as well as paid enterprise editions.
Upvotes: 1