Slow Query - Help with Optimization

Question

Hey guys. This is a follow-on from this question:

After getting the right data and making some tweaks based on requests from business, I've now got this mini-beast on my hands. This query should return the total number of new jobseeker registrations and the number of new uploaded CV's:

SELECT COUNT(j.jobseeker_id) as new_registrations,
(
    SELECT 
      COUNT(c.cv_id)
    FROM 
      tb_cv as c, tb_jobseeker, tb_industry
    WHERE
      UNIX_TIMESTAMP(c.created_at) >= '1241125200'
    AND 
      UNIX_TIMESTAMP(c.created_at) <= '1243717200'
    AND 
      tb_jobseeker.industry_id = tb_industry.industry_id
) 
AS uploaded_cvs
FROM 
  tb_jobseeker as j, tb_industry as i
WHERE
  j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
AND
  i.industry_id = j.industry_id
GROUP BY i.description, MONTH(j.created_at)

Notes: - The two values in the UNIX TIMESTAMP functions are passed in as parameters from the report module in our backend.

Every time I run it, MySQL chokes and lingers silently into the ether of the Interweb.

Help is appreciated.

Update: Hey guys. Thanks a lot for all the thoughtful and helpful comments. I'm only 2 weeks into my role here, so I'm still learning the schema. So, this query is somewhere between a thumbsuck and an educated guess. Will start to answer all your questions now.

Tomalak · Accepted Answer

tb_cv is not connected to the other tables in the sub-query. I guess this is the root cause for the slow query. It causes generation of a Cartesian product, yielding a lot more rows than you probably need.

Other than that I'd say you need indexes on tb_jobseeker.created_at, tb_cv.created_at and tb_industry.industry_id, and you might want to get rid of the UNIX_TIMESTAMP() calls in the sub-query since they prevent use of an index. Use BETWEEN and the actual field values instead.

Here is my attempt at understanding your query and writing a better version. I guess you want to get the count of new jobseeker registrations and new uploaded CVs per month per industry:

SELECT 
  i.industry_id,
  i.description, 
  MONTH(j.created_at)            AS month_created,
  YEAR(j.created_at)             AS year_created,
  COUNT(DISTINCT j.jobseeker_id) AS new_registrations,
  COUNT(cv.cv_id)                AS uploaded_cvs
FROM 
  tb_cv AS cv
  INNER JOIN tb_jobseeker AS j ON j.jobseeker_id = cv.jobseeker_id
  INNER JOIN tb_industry  AS i ON i.industry_id  = j.industry_id
WHERE
  j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
  AND cv.created_at BETWEEN '2009-05-01' AND '2009-05-31'
GROUP BY 
  i.industry_id,
  i.description, 
  MONTH(j.created_at),
  YEAR(j.created_at)

A few things I noticed while writing the query:

you GROUP BY values you don't output in the end. Why? (I've added the grouped field to the output list.)
you JOIN three tables in the sub-query while only ever using values from one of them. Why? I don't see what it would be good for, other than filtering out CV records that don't have a jobseeker or an industry attached — which I find hard to imagine. (I've removed the entire sub-query and used a simple COUNT instead.)
Your sub-query returns the same value every time. Did you maybe mean to correlate it in some way, to the industry maybe?.
The sub-query runs once for every record in a grouped query without being wrapped in an aggregate function.

Slow Query - Help with Optimization

Answers (2)

Related Questions