Henk Denneboom
Henk Denneboom

Reputation: 1583

Optimizing MySQL Query, takes almost 20 seconds!

I'm running the following query on a Macbook Pro 2.53ghz with 4GB of Ram:

SELECT
    c.id            AS id,
    c.name          AS name,
    c.parent_id     AS parent_id,
    s.domain        AS domain_name,
    s.domain_id     AS domain_id,
    NULL            AS stats
FROM
    stats s
LEFT JOIN stats_id_category sic ON s.id = sic.stats_id
LEFT JOIN categories c ON c.id = sic.category_id
GROUP BY
    c.name

It takes about 17 seconds to complete.

EXPLAIN:

alt text http://img7.imageshack.us/img7/1364/picture1va.png

The tables:

Information:

Number of rows: 147397
Data size: 20.3MB
Index size: 1.4MB

Table:

CREATE TABLE `stats` (
    `id` int(11) unsigned NOT NULL auto_increment,
    `time` int(11) NOT NULL,
    `domain` varchar(40) NOT NULL,
    `ip` varchar(20) NOT NULL,
    `user_agent` varchar(255) NOT NULL,
    `domain_id` int(11) NOT NULL,
    `date` timestamp NOT NULL default CURRENT_TIMESTAMP,
    `referrer` varchar(400) default NULL,
    KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=147398 DEFAULT CHARSET=utf8

Information second table:

Number of rows: 1285093
Data size: 11MB
Index size: 17.5MB

Second table:

CREATE TABLE `stats_id_category` (
    `stats_id` int(11) NOT NULL,
    `category_id` int(11) NOT NULL,
    KEY `stats_id` (`stats_id`,`category_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

Information third table:

Number of rows: 161
Data size: 3.9KB
Index size: 8KB

Third table:

CREATE TABLE `categories` (
    `id` int(11) NOT NULL auto_increment,
    `parent_id` int(11) default NULL,
    `name` varchar(40) NOT NULL,
    `questions_category_id` int(11) NOT NULL default '0',
    `rank` int(2) NOT NULL default '0',
    PRIMARY KEY  (`id`),    
    KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=205 DEFAULT CHARSET=latin1

Hopefully someone can help me speed this up.

Upvotes: 1

Views: 1262

Answers (4)

Bill Karwin
Bill Karwin

Reputation: 562871

I see several WTF's in your query:

  1. You use two LEFT OUTER JOINs but then you group by the c.name column which might have no matches. So perhaps you don't really need an outer join? If that's the case, you should use an inner join, because outer joins are often slower.

  2. You are grouping by c.name but this gives ambiguous results for every other column in your select-list. I.e. there might be multiple values in these columns in each grouping by c.name. You're lucky you're using MySQL, because this query would simply give an error in any other RDBMS.

    This is a performance issue because the GROUP BY is likely causing the "using temporary; using filesort" you see in the EXPLAIN. This is a notorious performance-killer, and it's probably the single biggest reason this query is taking 17 seconds. Since it's not clear why you're using GROUP BY at all (using no aggregate functions, and violating the Single-Value Rule), it seems like you need to rethink this.

  3. You are grouping by c.name which doesn't have a UNIQUE constraint on it. You could in theory have multiple categories with the same name, and these would be lumped together in a group. I wonder why you don't group by c.id if you want one group per category.

  4. SELECT NULL AS stats: I don't understand why you need this. It's kind of like creating a variable that you never use. It shouldn't harm performance, but it's just another WTF that makes me think you haven't thought this query through very well.

  5. You say in a comment you're looking for number of visitors per category. But your query doesn't have any aggregate functions like SUM() or COUNT(). And your select-list includes s.domain and s.domain_id which would be different for every visitor, right? So what value do you expect to be in the result set if you only have one row per category? This isn't really a performance issue either, it just means your query results don't tell you anything useful.

  6. Your stats_id_category table has an index over its two columns, but no primary key. So you can easily get duplicate rows, and this means your count of visitors may be inaccurate. You need to drop that redundant index and use a primary key instead. I'd order category_id first in that primary key, so the join can take advantage of the index.

    ALTER TABLE stats_id_category DROP KEY stats_id, 
      ADD PRIMARY KEY (category_id, stats_id);
    

Now you can eliminate one of your joins, if all you need to count is the number of visitors:

SELECT c.id, c.name, c.parent_id, COUNT(*) AS num_visitors
FROM categories c
INNER JOIN stats_id_category sic ON (sic.category_id = c.id)
GROUP BY c.id;

Now the query doesn't need to read the stats table at all, or even the stats_id_category table. It can get its count simply by reading the index of the stats_id_category table, which should eliminate a lot of work.

Upvotes: 2

Kibbee
Kibbee

Reputation: 66152

I agree with Bill. Point 2 is very important. The query doesn't even make logical sense. Also, with the simple fact that there is no where statement means that you have to pull back every row in the stats table, which seems to be around 140000. It then has to sort all that data, so that it can perform the GROUP BY. This is because sort [ O(n log n)] and then find duplicates [ O(n) ] is much faster than just finding duplicates without sorting the data set [ O(n^2)?? ].

Upvotes: 0

dj_segfault
dj_segfault

Reputation: 12449

Harrison is right; we need the other table. I would start by adding an index on category_id to stats_id_category, though.

Upvotes: 0

Harrison Fisk
Harrison Fisk

Reputation: 7144

You are missing the third table in the information provided (categories).

Also, it seems odd that you are doing a LEFT JOIN and then using the right table (which might be all NULLS) in the GROUP BY. You will end up grouping all of the non-matching rows together as a result, is that what you intended?

Finally, can you provide an EXPLAIN for the SELECT?

Upvotes: 0

Related Questions