Reputation: 29968
geneHomology
============
id genome_name gene_id homolog_genome_name homolog_gene_id consider_homolog
1 HomoSap 1007 MusMus 824 1
2 HomoSap 1007 MusMus 825 1
3 HomoSap 1007 MusMus 826 1
4 HomoSap 2890 EColi 2140 1
...
gene
====
genome_name gene_id gene_category
MusMus 823 Upregulated
MusMus 824 Downregulated
MusMus 825 Normal
MusMus 826 Normal
MusMus 827 Upregulated
EColi 2140 Normal
...
consider_homolog
is an enum (0,1). genome_name
and gene_id
are primary keys in gene
. geneHomology
is very large - some 200M rows.
My goal is to count for each gene in genes
how many homologs it has from each gene_category
.
For example, following the data above, HomoSap 1007
has 3 Normal
homologs and 1 Downregulated
.
So my query is:
SELECT a.id,a.genome_name,a.gene_id,a.homolog_genome_name,a.homolog_gene_id,COUNT(b.gene_category)
FROM geneHomology a,gene b
WHERE a.consider_homolog='1' AND a.homolog_genome_name=b.genome_name AND a.homolog_gene_id=b.gene_id
GROUP BY a.genome_name,a.gene_id,b.gene_category;
It never returns (and I've been patiently waiting for more than an hour).
I already indexed gene_category
in gene
.
I'm really new to MySQL but I have root access to the DB so I could follow your suggestions (carefully...). I would be happy to provide any further information.
UPDATE
This is the EXPLAIN
output for the query:
+----+-------------+-------+------+-----------------------+----------------------+---------+----------------------------------------------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------+----------------------+---------+----------------------------------------------------------+---------+---------------------------------+
| 1 | SIMPLE | b | ALL | PRIMARY,gene_genome | NULL | NULL | NULL | 1560695 | Using temporary; Using filesort |
| 1 | SIMPLE | a | ref | geneHomologyHit_gene | geneHomologyHit_gene | 54 | my_db_v71.b.gene_id,my_db_v71.b.genome_name | 13 | Using where |
+----+-------------+-------+------+-----------------------+----------------------+---------+----------------------------------------------------------+---------+---------------------------------+
UPDATE 2
mysql> SHOW INDEX FROM gene;
+-------+------------+--------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+--------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+
| gene | 0 | PRIMARY | 1 | gene_id | A | NULL | NULL | NULL | | BTREE | |
| gene | 0 | PRIMARY | 2 | genome_name | A | 1560695 | NULL | NULL | | BTREE | |
| gene | 1 | gene_organism | 1 | taxon_id | A | 392 | NULL | NULL | | BTREE | |
| gene | 1 | gene_genome | 1 | genome_name | A | 853 | NULL | NULL | | BTREE | |
| gene | 1 | gene_gene_category | 1 | gene_category | A | 5 | NULL | NULL | | BTREE | |
+-------+------------+--------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+
5 rows in set (0.01 sec)
UPDATE 3
mysql> SHOW INDEX FROM geneHomology;
+--------------+------------+------------------------+--------------+--------------------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+------------------------+--------------+--------------------------+-----------+-------------+----------+--------+------+------------+---------+
| geneHomology | 0 | PRIMARY | 1 | id | A | 680326661 | NULL | NULL | | BTREE | |
| geneHomology | 1 | geneHomologyQuery_gene | 1 | gene_id | A | 1498516 | NULL | NULL | | BTREE | |
| geneHomology | 1 | geneHomologyQuery_gene | 2 | genome_name | A | 1505147 | NULL | NULL | | BTREE | |
| geneHomology | 1 | geneHomologyHit_gene | 1 | homolog_gene_id | A | 52332820 | NULL | NULL | | BTREE | |
| geneHomology | 1 | geneHomologyHit_gene | 2 | homolog_genome_name | A | 52332820 | NULL | NULL | | BTREE | |
+--------------+------------+------------------------+--------------+--------------------------+-----------+-------------+----------+--------+------+------------+---------+
5 rows in set (0.00 sec)
UPDATE 4
Is there a way to get only partial results to even see I'm getting what I want? I tried LIMIT 1000
and even LIMIT 10
but it doesn't seem to change anything.
UPDATE 5
mysql> SHOW CREATE TABLE geneHomology;
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| geneHomology | CREATE TABLE `geneHomology` (
`id` bigint(20) NOT NULL auto_increment,
`genome_name` varchar(20) NOT NULL,
`gene_id` varchar(30) NOT NULL,
`homolog_genome_name` varchar(20) NOT NULL,
`homolog_gene_id` varchar(30) NOT NULL,
`homolog_length` bigint(20) unsigned NOT NULL,
`significance` double unsigned NOT NULL,
`bit_score` double unsigned NOT NULL,
`percent_identity` double unsigned NOT NULL,
`start_match` int(10) unsigned NOT NULL,
`end_match` int(10) unsigned NOT NULL,
`start_match_percent` double unsigned NOT NULL,
`end_match_percent` double unsigned NOT NULL,
`strand` enum('+','-') default NULL,
`homolog_start_match` int(10) unsigned NOT NULL,
`homolog_end_match` int(10) unsigned NOT NULL,
`homolog_start_match_percent` double unsigned NOT NULL,
`homolog_end_match_percent` double unsigned NOT NULL,
`homolog_strand` enum('+','-') default NULL,
`consider_gene_homology` enum('0','1') NOT NULL,
`reason_not_considered` varchar(50) default NULL,
`num_hsps` int(10) unsigned NOT NULL,
`homology_type` varchar(2) NOT NULL,
PRIMARY KEY (`id`),
KEY `geneHomologygene` (`gene_id`,`genome_name`),
KEY `geneHomologyhomolog_gene` (`homolog_gene_id`,`homolog_genome_name`)
) ENGINE=MyISAM AUTO_INCREMENT=680326662 DEFAULT CHARSET=latin1 |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW CREATE TABLE gene;
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| gene | CREATE TABLE `gene` (
`taxon_id` int(10) unsigned NOT NULL,
`genome_name` varchar(20) NOT NULL,
`gene_id` varchar(30) NOT NULL,
`symbol` varchar(30) default NULL,
`type` varchar(30) default NULL,
`product` varchar(300) default NULL,
`strand` enum('+','-') NOT NULL,
`start` bigint(20) unsigned NOT NULL,
`end` bigint(20) unsigned NOT NULL,
`gene_category` enum('Upregulated','Downregulated','Normal','n/a') NOT NULL,
`consider_gene` enum('0','1') NOT NULL,
`reason_not_considered` varchar(50) default NULL,
`sequence` longblob NOT NULL,
`additional_info` varchar(300) default NULL,
PRIMARY KEY (`gene_id`,`genome_name`),
KEY `gene_organism` (`taxon_id`),
KEY `gene_genome` (`genome_name`),
KEY `gene_gene_category` (`gene_category`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Upvotes: 3
Views: 900
Reputation: 425251
SELECT a.genome_name, a.gene_id,
cats.gene_category,
(
SELECT COUNT(*)
FROM geneHomology ab
JOIN gene b
ON b.genome_name = ab.homolog_genome_name
AND b.gene_id = ab.homolog_gene_id
WHERE ab.genome_name = a.genome_name
AND ab.gene_id = a.gene_id
AND b.gene_category = cats.gene_category
) cx
FROM gene a
CROSS JOIN
(
SELECT 'Normal' AS gene_category
UNION ALL
SELECT 'Upregulated' AS gene_category
UNION ALL
SELECT 'Downregulated' AS gene_category
) cats
LIMIT 100
This will remove filesort
from your plan.
If you have a table with all possible gene_categories
, replace the cats
with it.
Upvotes: 1
Reputation: 48357
First remove the reference to genome_name from the WHERE part of the query - if both gene.gene_id and gene.genome_name are unique then here is a clear functional dependency here which is confusing the issue somewhat - the number/number join will be marginally mroe efficient that the text/text join.
Looking at the plan it rather implies you've already got an index on geneHomology.hit_gene_id . If this is the case then there's not much scope for making the query go faster without schema changes. However a key length of 54 suggests you've got a lot of stuff in that index which should not be there. Trimming this down to just hit_gene_id and consider_homolog will help a little with the performance, but the limiting factor is that there does not seem any way to avoid the full table scan on gene unless there are other functional dependencies in there.
How quickly does it take to complete a 'SELECT * FROM gene' ? How many records in ene_homology?
It looks like geneHomology decomposes a N:M relationship between gene and gene (itself) and applies labels.
If the number of values in homolog_genome_name is relatively small you might consider decomposing this into gene using bitmap field. Or possibly denormalise the relationship into a set of 1:1 mappings. Alternatively you could enumerate the homolog clusters.
Upvotes: 0
Reputation: 4158
From what you've posted here, I would reccomend a minor bit of denormalisation and put gene_category into geneHomolgy. You can then get rid of the join entirely and you can create an index on your consider_homolog + GROUP BY fields.
Upvotes: 0