Hank
Hank

Reputation: 3497

Optimize NOT IN subquery

I'm using a MySQL database and trying to optimize the following query:

SELECT * FROM main 
WHERE (name IS NOT NULL AND name != '') 
AND user_id NOT IN
( SELECT user_id FROM process
WHERE sns_id = 1  AND (process.status=1 OR process.status=2) 
AND process.updated_at > 1392335789207) LIMIT 100;

Essentially I want to make sure all those user_id in the process table are not included when selecting rows from main.

I've tried optimizing using EXISTS but that doesn't seem to produce any better performance.

How could I optimize this query?

Upvotes: 1

Views: 228

Answers (2)

Justin
Justin

Reputation: 9724

You can use Query:

SELECT *
FROM main
WHERE name IS NOT NULL
 AND name != ''
 AND user_id NOT EXISTS
    (SELECT user_id
     FROM process
     WHERE sns_id = 1
       AND process.status in (1,2)
       AND process.updated_at > 1392335789207
       AND process.user_id = main.user_id) 
LIMIT 100

You saying that there no duplicate user id, so you can try use this query which will be probably the fastest solution:

SELECT m.*
FROM main m
LEFT JOIN process p
 ON p.user_id = m.user_id
 AND p.sns_id = 1
 AND p.status in (1,2)
 AND p.updated_at > 1392335789207
WHERE m.name IS NOT NULL
 AND m.name != ''
 AND p.user_id is null
LIMIT 100

Upvotes: 2

spencer7593
spencer7593

Reputation: 108510

I believe that with a NOT IN (subquery), that subquery gets evaluated for each and every row returned by the outer query. If you have 500,000 rows to process through, that's 500,000 executions of the same subquery. An index on process is almost mandatory for performance.

A (usually) much more efficient approach for large sets is to use an anti-join pattern.

To do that, we do an OUTER JOIN to find matching rows from process, and then we discard any rows that found a match. What we are left with is rows from main that didn't have a matching row. Equivalent to the NOT IN (subquery), but (usually) much more efficient with large sets.

SELECT m.*
  FROM main m
  LEFT
  JOIN ( SELECT p.user_id
           FROM process p
          WHERE p.sns_id = 1
            AND p.status IN (1,2)
            AND p.updated_at > 1392335789207
          GROUP BY p.user_id
       ) r
    ON r.user_id = m.user_id
 WHERE (m.name IS NOT NULL AND m.name != '')
   AND r.user_id IS NULL
 LIMIT 100;

Note that we use the LEFT keyword on the join operation to specify we want all rows from main to be returned, even if there is no matching row from r. And we include r.user_id IS NULL as a predicate in the WHERE clause, to exclude rows that had a match from r. (The equality comparator in the JOIN predicate (r.user_id = m.user_id) guarantees us that any matching rows from r will have a user_id value that IS NOT NULL. So any NULL values of r.user_id must be the result of not finding a match from r.

For performance, you may want at covering index available on the process table, for example.

ON process ('user_id, `updated_at`, `status`, `sns_id`)

The GROUP BY in the inline view query r isn't strictly necessary, but we don't really need multiple rows returned for the same user_id, one is enough. With an index with a leading column of user_id, MySQL will often use an index operation to perform the GROUP BY, avoiding a Using filesort operation. We also don't need to return NULL values for user_id.

Upvotes: 0

Related Questions