Ben
Ben

Reputation: 512

Optimising MySQL query on JOINed tables with GROUP BY and ORDER BY without using nested queries

This feels like a bit of a beginner SQL question to me, but here goes. This is what I'm trying to do:

I've described the tables involved at the bottom of this post, but here's the query that I'm performing

   SELECT p.*, GROUP_CONCAT(pt.name) 
     FROM products p
LEFT JOIN product_tags_for_products pt4p ON (pt4p.product_id = p.id)
LEFT JOIN product_tags pt ON (pt.id = pt4p.product_tag_id)
 GROUP BY p.id
 ORDER BY p.created 
    LIMIT 30;

There are about 280,000 products, 130 tags, 524,000 linking records and I've ANALYZEd the tables. The problem is that it's taking over 80s to run (on decent hardware), which feels wrong to me.

Here's the EXPLAIN results:

id   select_type    table    type    possible_keys                    key                              key_len    ref                   rows  Extra
1    SIMPLE         p        index   NULL                             created                          4          NULL                  30    "Using temporary"
1    SIMPLE         pt4p     ref     idx_product_tags_for_products    idx_product_tags_for_products    3          s.id                  1     "Using index"
1    SIMPLE         pt       eq_ref  PRIMARY                          PRIMARY                          4          pt4p.product_tag_id   1    

I think it's doing things in the wrong order, i.e. ORDERing the results after the join, using a large temporary table, and then LIMITing. The query plan in my head would go something like this:

This sounds simple, but it doesn't seem to work like that - am I missing something?


CREATE TABLE `products` (
  `id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `rating` float NOT NULL,
  `created` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `last_modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `active` tinyint(1) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `created` (`created`),
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

CREATE TABLE `product_tags_for_products` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `product_id` mediumint(8) unsigned NOT NULL,
  `product_tag_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_product_tags_for_products` (`product_id`,`product_tag_id`),
  KEY `product_tag_id` (`product_tag_id`),
  CONSTRAINT `product_tags_for_products_ibfk_1` FOREIGN KEY (`product_id`) REFERENCES `products` (`id`),
  CONSTRAINT `product_tags_for_products_ibfk_2` FOREIGN KEY (`product_tag_id`) REFERENCES `product_tags` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


CREATE TABLE `product_tags` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Updated with profiling information at Salman A's request:

Status,  
  Duration,CPU_user,CPU_system,Context_voluntary,Context_involuntary,Block_ops_in,Block_ops_out,Messages_sent,Messages_received,Page_faults_major,Page_faults_minor,Swaps,Source_function,Source_file,Source_line
starting,              
  0.000124,0.000106,0.000015,0,0,0,0,0,0,0,0,0,NULL,NULL,NULL
"Opening tables",      
  0.000022,0.000020,0.000003,0,0,0,0,0,0,0,0,0,"unknown function",sql_base.cc,4519
"System lock",   
  0.000007,0.000004,0.000002,0,0,0,0,0,0,0,0,0,"unknown function",lock.cc,258
"Table lock",   
  0.000011,0.000009,0.000002,0,0,0,0,0,0,0,0,0,"unknown function",lock.cc,269
init,           
  0.000055,0.000054,0.000001,0,0,0,0,0,0,0,0,0,"unknown function",sql_select.cc,2524
optimizing,       
  0.000008,0.000006,0.000002,0,0,0,0,0,0,0,0,0,"unknown function",sql_select.cc,833
statistics,      
  0.000116,0.000051,0.000066,0,0,0,0,0,0,0,1,0,"unknown function",sql_select.cc,1024
preparing,       
  0.000027,0.000023,0.000003,0,0,0,0,0,0,0,0,0,"unknown function",sql_select.cc,1046
"Creating tmp table",
  0.000054,0.000053,0.000002,0,0,0,0,0,0,0,0,0,"unknown function",sql_select.cc,1546
"Sorting for group", 
  0.000018,0.000015,0.000003,0,0,0,0,0,0,0,0,0,"unknown function",sql_select.cc,1596
executing,       
  0.000004,0.000002,0.000001,0,0,0,0,0,0,0,0,0,"unknown function",sql_select.cc,1780
"Copying to tmp table", 
  0.061716,0.049455,0.013560,0,18,0,0,0,0,0,3680,0,"unknown function",sql_select.cc,1927
"converting HEAP to MyISAM",
  0.046731,0.006371,0.017543,3,5,0,3,0,0,0,32,0,"unknown function",sql_select.cc,10980
"Copying to tmp table on disk", 
 10.700166,3.038211,1.191086,538,1230,1,31,0,0,0,65,0,"unknown function",sql_select.cc,11045
"Sorting result", 
  0.777887,0.155327,0.618896,2,137,0,1,0,0,0,634,0,"unknown function",sql_select.cc,2201
"Sending data", 
  0.000336,0.000159,0.000178,0,0,0,0,0,0,0,1,0,"unknown function",sql_select.cc,2334
end, 
  0.000005,0.000003,0.000002,0,0,0,0,0,0,0,0,0,"unknown function",sql_select.cc,2570
"removing tmp table", 
  0.106382,0.000058,0.080105,4,9,0,11,0,0,0,0,0,"unknown function",sql_select.cc,10912
end, 
  0.000015,0.000007,0.000007,0,0,0,0,0,0,0,0,0,"unknown function",sql_select.cc,10937
"query end", 
  0.000004,0.000002,0.000001,0,0,0,0,0,0,0,0,0,"unknown function",sql_parse.cc,5083
"freeing items", 
  0.000012,0.000012,0.000001,0,0,0,0,0,0,0,0,0,"unknown function",sql_parse.cc,6107
"removing tmp table", 
  0.000010,0.000009,0.000001,0,0,0,0,0,0,0,0,0,"unknown function",sql_select.cc,10912
"freeing items", 
  0.000084,0.000022,0.000057,0,1,0,0,1,0,0,0,0,"unknown function",sql_select.cc,10937
"logging slow query", 
  0.000004,0.000001,0.000001,0,0,0,0,0,0,0,0,0,"unknown function",sql_parse.cc,1723
"logging slow query", 
  0.000049,0.000031,0.000018,0,0,0,0,0,0,0,0,0,"unknown function",sql_parse.cc,1733
"cleaning up", 
  0.000007,0.000005,0.000002,0,0,0,0,0,0,0,0,0,"unknown function",sql_parse.cc,1691

The tables are:

Products = 84.1MiB (there are extra fields in the products table which I omitted for clarity) Tags = 32KiB Linking table = 46.6MiB

Upvotes: 5

Views: 376

Answers (3)

Ilmari Karonen
Ilmari Karonen

Reputation: 50328

Regarding filtering on tags (which you mentioned in the comments on Johan's answer), if the obvious

SELECT p.*, GROUP_CONCAT(pt.name) AS tags
FROM products p
  JOIN product_tags_for_products pt4p2 ON (pt4p2.product_id = p.id)
  JOIN product_tags pt2 ON (pt2.id = pt4p2.product_tag_id)
  LEFT JOIN product_tags_for_products pt4p ON (pt4p.product_id = p.id)
  LEFT JOIN product_tags pt ON (pt.id = pt4p.product_tag_id)
WHERE pt2.name IN ('some', 'tags', 'here')
GROUP BY p.id
ORDER BY p.created LIMIT 30

doesn't run fast enough, you could always try this:

CREATE TEMPORARY TABLE products30
  SELECT p.*
  FROM products p
    JOIN product_tags_for_products pt4p ON (pt4p.product_id = p.id)
    JOIN product_tags pt ON (pt.id = pt4p.product_tag_id)
  WHERE pt.name IN ('some', 'tags', 'here')
  GROUP BY p.id
  ORDER BY p.created LIMIT 30

SELECT p.*, GROUP_CONCAT(pt.name) AS tags
FROM products30 p
  LEFT JOIN product_tags_for_products pt4p ON (pt4p.product_id = p.id)
  LEFT JOIN product_tags pt ON (pt.id = pt4p.product_tag_id)
GROUP BY p.id
ORDER BY p.created

(I used a temp table because you said "no subqueries"; I don't know if they're any easier to use in an Active Record framework, but at least it's another way to do it.)


Ps. One really off-the-wall idea about your original problem: would it make any difference if you changed the GROUP BY p.id clause to GROUP BY p.created, p.id? Probably not, but I'd at least try it.

Upvotes: 0

Alex Kremer
Alex Kremer

Reputation: 1

Ah - I see that none of the keys you GROUP BY on are BTREE, by default PRIMARY keys are hashes. It helps group by when there is an ordering index... otherwise it has to scan...

What I mean is, I think it would help significantly if you added a BTREE based index for p.id and p.created. In that case I think the engine will avoid having to scan/sort all those keys to execute group by and order by.

Upvotes: 0

Johan
Johan

Reputation: 76557

I would try limiting the number of products to 30 first and then joining with only 30 products:

   SELECT p.*, GROUP_CONCAT(pt.name) as tags
     FROM (SELECT p30.* FROM products p30 ORDER BY p30.created LIMIT 30) p 
LEFT JOIN product_tags_for_products pt4p ON (pt4p.product_id = p.id) 
LEFT JOIN product_tags pt ON (pt.id = pt4p.product_tag_id) 
 GROUP BY p.id 
 ORDER BY p.created  

I know you said no subqueries but you did not explain why, and I don't see any other way to solve your issue.

Note that you can eliminate the subselect by putting that in a view:

CREATE VIEW v_last30products AS 
  SELECT p30.* FROM products p30 ORDER BY p30.created LIMIT 30; 

Then the query is simplified to:

   SELECT p.*, GROUP_CONCAT(pt.name) as tags
     FROM v_last30products p 
LEFT JOIN product_tags_for_products pt4p ON (pt4p.product_id = p.id) 
LEFT JOIN product_tags pt ON (pt.id = pt4p.product_tag_id) 
 GROUP BY p.id 
 ORDER BY p.created  

Other issue, your n-to-n table product_tags_for_products

Does not make sense, I'd restructure it like so:

CREATE TABLE `product_tags_for_products` (    
  `product_id` mediumint(8) unsigned NOT NULL,    
  `product_tag_id` int(10) unsigned NOT NULL,    
  PRIMARY KEY (`product_id`,`product_tag_id`),       
  CONSTRAINT `product_tags_for_products_ibfk_1` FOREIGN KEY (`product_id`) REFERENCES `products` (`id`),    
  CONSTRAINT `product_tags_for_products_ibfk_2` FOREIGN KEY (`product_tag_id`) REFERENCES `product_tags` (`id`)    
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci  

This should make the query faster by:
- shortening the key used (On InnoDB the PK is always included in secondary keys);
- Allowing you to use the PK which should be faster than using a secondary key;

More speed issues
If you replace the select * with only the fields that you need select p.title, p.rating, ... FROM that will also speed up things a little.

Upvotes: 3

Related Questions