sproketboy
sproketboy

Reputation: 9468

stackoverflow db schema - multi-column join SLOW with EXISTS

I'm doing some performance testing and the StackOverflow2010 DB is great for this.

Looking here for testing multi-column joins SQL WHERE.. IN clause multiple columns

The StackOverflow DB is available here: https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/

I'm trying to query for Comments for Posts for specific users:

-- get users id < 5
SELECT *
FROM [Users]
WHERE Id < 5;

-- get posts for those users
SELECT *
FROM [Posts]
WHERE [OwnerUserId] IN (SELECT [Id] FROM [Users] WHERE Id < 5);

-- get comments for these posts 
SELECT *
FROM [Comments]
WHERE [PostId] IN (SELECT [Id]
                   FROM [Posts]
                   WHERE [OwnerUserId] IN (SELECT [Id] FROM [Users] WHERE Id < 5))
  AND [UserId] IN (SELECT [OwnerUserId]
                   FROM [Posts]
                   WHERE [OwnerUserId] IN (SELECT [Id] FROM [Users] WHERE Id < 5));

-- THIS version seems to select ALL comments and takes forever!
SELECT *
FROM [Comments]
WHERE EXISTS (SELECT *
              FROM [Posts]
              WHERE [OwnerUserId] IN (SELECT [Id] FROM [Users] WHERE Id < 5));

The IN query works but it selects some Comments that the OwnerUserId didn't make (orphans).

The EXISTS query (which seems to be the correct way) seems to select ALL comments and takes forever!

Any ideas? The reason for this is explained here: https://sproket.github.io/Persism/n+1.html

Upvotes: 0

Views: 70

Answers (1)

Thom A
Thom A

Reputation: 96004

You are massively overly complicating the issue here.

Firstly, let's have a look at your second query. This doesn't actually need to reference the Users table at all. You're looking for details of Posts where the User ID is < 5, and the User ID is already in the Posts table, OwnerUserId. So you could just write your second query as:

SELECT *
FROM dbo.Posts
WHERE OwnerUserId < 5;

For your next query, the same applies again, you don't need the subqueries, the ID of the user is already in the Comments table:

SELECT *
FROM dbo.Comments
WHERE UserId < 5;

As for your EXISTS the reason it isn't working is because it isn't correlated. As such the subquery in your EXISTS always returns at least one row, and so the EXISTS evaluates to TRUE; and so every row in the table Comments is returned.

A correlated query would look something like this (this is not relevant to the queries you have, as they are overly complicated):

SELECT {Columns}
FROM dbo.SomeTable ST
WHERE EXISTS (SELECT 1
              FROM dbo.OtherTable OT
              WHERE OT.OtherInt > 10
                AND OT.SomeID = ST.ID);

Note the clause OT.SomeID = ST.ID which means that the tables dbo.SomeTable and dbo.OtherTable have some relationship. As such the above query would only return rows from dbo.SomeTable where a related row existed in dbo.OtherTable and that related row has a value greater than 10 in the column OtherInt.

Upvotes: 2

Related Questions