Reputation: 23
I am relatively new to MySQL and since days I try to improve a query which is performed on a big table (>70 Mio. rows). The good news, I got the correct result. Unfortunately, it takes round about 7 minutes for calculation, which is (in my opinion) way to long. I searched for ways to improve queries and came up with a multiple column index of the table, but now I am somehow stuck and don't know how to improve further.
The table consists of 5 columns, all of them are of interest. The columns state, cur, min and max are INT. The sample column stores some names, which are multiple times in the this column. Every row stores the information of one state of one sample. Every sample has millions of different states. The table looks like this:
samples state cur min max
Sample1 1 58 12 110
Sample1 0 8 12 110
Sample2 1 70 150 190
Sample4 2 10 1 20
Sample3 2 80 50 70
Sample6 2 3 1 10
Sample5 0 18 21 90
Sample5 1 22 21 90
.
.
.
Now I want to perform some statistics on the table: How many times is a state with cur between min and max. I want to get this numbers for every sample. In addition, I also want to calculate the relative amount of the states 1, 2 and 3 for every sample.
The resulting table looks like this:
total amount state 0 state 1 state 2 state 0 % state 1 % state 2 %
Sample1 14504366 13199105 961629 343632 91.0009 6.6299 2.3692
Sample2 13873909 12628523 926846 318540 91.0235 6.6805 2.2960
Sample3 10919017 9231997 828767 858253 84.5497 7.5901 7.8602
Sample4 10148540 8604527 768220 775793 84.7859 7.5698 7.6444
Sample5 14130796 12382867 1078724 669205 87.6304 7.6339 4.7358
Sample6 11307051 9947652 871388 488011 87.9774 7.7066 4.3160
I got the result by using the following code:
# state, cur, min and max are INT
# "samples" is Varchar40
# build the index
ALTER TABLE data_table ADD INDEX `index_name` (state, cur, min, max, samples);
# query
SELECT t.samples,
COUNT(t.state) AS "total amount",
amount_0 AS "state 0",
amount_1 AS "state 1",
amount_2 AS "state 2",
amount_0 / COUNT(t.state) * 100 AS "state 0 %",
amount_1 / COUNT(t.state) * 100 AS "state 1 %",
amount_2 / COUNT(t.state) * 100 AS "state 2 %"
FROM data_table t
JOIN
(
SELECT samples, COUNT(state) as amount_0
FROM data_table
WHERE state = 0 AND cur > min + 15 AND cur < max -20
GROUP BY samples
) tmp0 ON tmp0.samples = t.samples
JOIN
(
SELECT samples, COUNT(state) as amount_1
FROM data_table
WHERE state = 1 AND cur > min + 15 AND cur < max -20
GROUP BY samples
) tmp1 ON tmp1.samples = t.samples
JOIN
(
SELECT samples, COUNT(state) as amount_2
FROM data_table
WHERE state = 2 AND cur > min + 15 AND cur < max -20
GROUP BY samples
) tmp2 ON tmp2.samples = t.samples
WHERE cur > min + 15 AND cur < max -20
GROUP BY t.samples;
EXPLAIN returns following output:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t [NULL] index index_name index_name 143 [NULL] 73647812 11.11 Using where; Using index; Using temporary; Using filesort
1 PRIMARY <derived4> [NULL] ref <auto_key0> <auto_key0> 123 db.t.Sample 10 100 [NULL]
1 PRIMARY <derived2> [NULL] ref <auto_key0> <auto_key0> 123 db.t.Sample 10 100 [NULL]
1 PRIMARY <derived3> [NULL] ref <auto_key0> <auto_key0> 123 db.t.Sample 10 100 [NULL]
4 DERIVED data_table [NULL] ref index_name index_name 5 const 7547114 11.11 Using where; Using index; Using temporary; Using filesort
3 DERIVED data_table [NULL] ref index_name index_name 5 const 15150796 11.11 Using where; Using index; Using temporary; Using filesort
2 DERIVED data_table [NULL] ref index_name index_name 5 const 36823906 11.11 Using where; Using index; Using temporary; Using filesort
I think a big problem are the JOIN parts, where probably the indices get lost. A second time consuming step could be - if I unterstand correctly - the usage of WHERE, which will lead to join everything and than removing the rows which are not fitting the case in a second step. This is for sure time consuming and can be avoided by using ON instead of WHERE(?). My problem is, I don't know how to implement the workaround. Therefore I hope you can help me out.
Upvotes: 2
Views: 57
Reputation: 133370
You could avoid some subquery using conditional aggregation
SELECT t.samples,
COUNT(t.state) AS "total amount",
amount_0 AS "state 0",
amount_1 AS "state 1",
amount_2 AS "state 2",
amount_0 / COUNT(t.state) * 100 AS "state 0 %",
amount_1 / COUNT(t.state) * 100 AS "state 1 %",
amount_2 / COUNT(t.state) * 100 AS "state 2 %"
FROM data_table t
INNER JOIN (
SELECT samples
, COUNT(case when state = 0 then 1 else null end) as amount_0
, COUNT(case when state = 1 then 1 else null end) as amount_1
, COUNT(case when state = 2 then 1 else null end) as amount_0
FROM data_table
WHERE cur > min + 15 AND cur < max -20
GROUP BY samples
) tmp ON tmp.samples = t.samples
WHERE cur > min + 15 AND cur < max -20
GROUP BY t.samples;
Upvotes: 1