vdegenne
vdegenne

Reputation: 13270

IN vs NOT EXISTS clause, query optimization

Here are the simplified tables for the example :

my tables

And this is my query :

SELECT * FROM article A

WHERE A.id IN (

   SELECT AT.article_id FROM article A1, article_tag AT

   WHERE (A1.id = AT.article_id)

   AND (AT.tag_id IN (2,1))

   GROUP BY AT.article_id

   HAVING count(AT.article_id) = 2

);

Technically this query does seem to work and returns "all the articles having at least keywords 2 and 1".

The bold parts are meant to change. For instance if my keywords list is [1, 3, 4],

(2, 1) will change to (1, 3, 4) and 2 will change to 3 (the list's length).

Though this query does work, i slightly remember about a friend using NOT EXISTS clause. is it applicable ? if yes, which query is the best optimization in term of performance ?

Upvotes: 2

Views: 1734

Answers (3)

Frank Schmitt
Frank Schmitt

Reputation: 30765

You don't need EXISTS or IN - just GROUP BY all the columns from article, and you're done. Example (in Oracle syntax, since you didn't mention your RDBMS):

with article(id, title) as (
    select 1, 'MS SQL Server' from dual union all
    select 2, 'Oracle' from dual union all
    select 3, 'PostgreSQL' from dual union all
    select 4, 'IDBM DB2' from dual),
  article_tag(id, article_id, tag_id) as (
    select 1,1,1 from dual union all
    select 2,1,2 from dual union all
    select 3,2,1 from dual union all
    select 4,3,2 from dual
  ) 
  SELECT a.id, a.title 
  FROM article a
  JOIN article_tag at ON a.id = at.article_id
       AND at.tag_id IN (2,1)
  GROUP BY a.id, a.title
  HAVING COUNT(at.article_id) = 2;

Upvotes: 2

Luaan
Luaan

Reputation: 63722

One problem I can see with this is that the DB will not be able to reuse the query plan, because for different parameters, you're changing the query text. This will often result in sub-optimal query plans.

So rather than considering in vs. exists, consider passing the options as a parameter. You didn't specify what DB engine you're working with, so I can't say anything specific. On MS SQL for example, you can use a table parameter or xml parameter to do this, helping the query planner do its job.

There's also a few tricks you could use to make the query more straightforward, but that's the kind of thing you should only do with concrete statistics on how the query is used and how it performs.

Be wary of anecdotal evidence - a lot of the arguments against using in (subquery) and similar are years old, and no longer need to apply at all to your scenario. Measure. Prepare realistic test data. Measure. Guessing hurts :)

Upvotes: 0

Peter
Peter

Reputation: 27934

if yes, which query is the best optimization in term of performance?

When optimizing sql your only friend are query plans and switch on statistics. The results often depends on the data in your tables. In query plans you can see what sql-server does on certain tables and with statistics you can see raw timings and reads.

When looking at your query you can make a INNER JOIN, EXISTS, IN, TABLE VALUED FUNCTION, INLINE TABLE VALUED FUNCTION etc. In many cases sql server will optimize them to the same query plan. However in some case it does not. Start by making the queries and see what indexes are needed, in many cases the indexes are more important than how you write the query.

The key point in optimizing sql queries is using real data and real parameters. Then measure, measure, measure and analyze the IO/Reads/Timings etc.

Upvotes: 1

Related Questions