paulophoenix
paulophoenix

Reputation: 89

Efficient MySQL query for huge set of data

Say i have a table like below:

CREATE TABLE `hadoop_apps` (
  `clusterId` smallint(5) unsigned NOT NULL,
  `appId` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
  `user` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
  `queue` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
  `appName` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `submitTime` datetime NOT NULL COMMENT 'App submission time',
  `finishTime` datetime DEFAULT NULL COMMENT 'App completion time',
  `elapsedTime` int(11) DEFAULT NULL COMMENT 'App duration in milliseconds',
  PRIMARY KEY (`clusterId`,`appId`,`submitTime`),
  KEY `hadoop_apps_ibk_finish` (`finishTime`),
  KEY `hadoop_apps_ibk_queueCluster` (`queue`,`clusterId`),
  KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

mysql> SELECT COUNT(*) FROM hadoop_apps;

This would return me a count 158593816

So I am trying to understand what is inefficient about the below query and how I can improve it.

mysql> SELECT * FROM hadoop_apps WHERE DATE(finishTime)='10-11-2013';

Also, what's the difference between these two queries?

mysql> SELECT * FROM hadoop_apps WHERE user='foobar';
mysql> SELECT * FROM hadoop_apps HAVING user='foobar';

Upvotes: 1

Views: 74

Answers (2)

Rick James
Rick James

Reputation: 142560

If WHERE works, it is preferred over HAVING. The former is done earlier in the processing, thereby cutting down on the amount of data to shovel through. OK, in your one example, there may be no difference between them.

I cringe whenever I see a DATETIME in a UNIQUE key (your PK). Can't the app have two rows in the same second? Is that a risk you want to take.

Even changing to DATETIME(6) (microseconds) could be risky.

Regardless of what you do in that area, I recommend this pattern for testing:

WHERE finishTime >= '2013-10-11'
  AND finishTime  < '2013-10-11' + INTERVAL 1 DAY

It works "correctly" for DATE, DATETIME, and DATETIME(6), etc. Other flavors add an extra midnight or miss parts of a second. And it avoids hassles with leapdays, etc, if the interval is more than a single day.

KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`)

is bad. It won't get past user(8). And prefixing like that is often useless. Let's see the query that tempted you to build that key; we'll come up with a better one.

158M rows with 4 varchars. And they sound like values that don't have many distinct values? Build lookup tables and replace them with SMALLINT UNSIGNED (2 bytes, 0..64K range) or other small id. This will significantly shrink the table, thereby making it faster.

Upvotes: 1

Bill Karwin
Bill Karwin

Reputation: 563011

WHERE DATE(finishTime)='10-11-2013';

This is a problem for the optimizer because anytime you put a column into a function like this, the optimizer doesn't know if the order of values returned by the function will be the same as the order of values input to the function. So it can't use an index to speed up lookups.

To solve this, refrain from putting the column inside a function call like that, if you want the lookup against that column to use an index.

Also, you should use MySQL standard date format: YYYY-MM-DD.

WHERE finishTime BETWEEN '2013-10-11 00:00:00' AND '2013-10-11 23:59:59'

What is the difference between [conditions in WHERE and HAVING clauses]?

The WHERE clause is for filtering rows.

The HAVING clause is for filtering results after applying GROUP BY.

See SQL - having VS where

Upvotes: 3

Related Questions