Efficient MySQL query for huge set of data

Question

Say i have a table like below:

CREATE TABLE `hadoop_apps` (
  `clusterId` smallint(5) unsigned NOT NULL,
  `appId` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
  `user` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
  `queue` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
  `appName` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `submitTime` datetime NOT NULL COMMENT 'App submission time',
  `finishTime` datetime DEFAULT NULL COMMENT 'App completion time',
  `elapsedTime` int(11) DEFAULT NULL COMMENT 'App duration in milliseconds',
  PRIMARY KEY (`clusterId`,`appId`,`submitTime`),
  KEY `hadoop_apps_ibk_finish` (`finishTime`),
  KEY `hadoop_apps_ibk_queueCluster` (`queue`,`clusterId`),
  KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

mysql> SELECT COUNT(*) FROM hadoop_apps;

This would return me a count 158593816

So I am trying to understand what is inefficient about the below query and how I can improve it.

mysql> SELECT * FROM hadoop_apps WHERE DATE(finishTime)='10-11-2013';

Also, what's the difference between these two queries?

mysql> SELECT * FROM hadoop_apps WHERE user='foobar';
mysql> SELECT * FROM hadoop_apps HAVING user='foobar';

Bill Karwin · Accepted Answer

WHERE DATE(finishTime)='10-11-2013';

This is a problem for the optimizer because anytime you put a column into a function like this, the optimizer doesn't know if the order of values returned by the function will be the same as the order of values input to the function. So it can't use an index to speed up lookups.

To solve this, refrain from putting the column inside a function call like that, if you want the lookup against that column to use an index.

Also, you should use MySQL standard date format: YYYY-MM-DD.

WHERE finishTime BETWEEN '2013-10-11 00:00:00' AND '2013-10-11 23:59:59'

What is the difference between [conditions in WHERE and HAVING clauses]?

The WHERE clause is for filtering rows.

The HAVING clause is for filtering results after applying GROUP BY.

See SQL - having VS where

Efficient MySQL query for huge set of data

Answers (2)

Related Questions