Reputation: 40746
Currently I'm having the following table structure.
Master table Documents:
ID | Filename |
---|---|
1 | document1.pdf |
2 | document2.pdf |
3 | document3.pdf |
Detail table Keywords:
ID | DocumentID | Keyword |
---|---|---|
1 | 1 | KeywordA |
2 | 1 | KeywordB |
3 | 1 | KeywordC |
4 | 2 | KeywordB |
5 | 3 | KeywordA |
6 | 3 | KeywordD |
Code to create this:
CREATE TABLE Documents (
ID int IDENTITY(1,1) PRIMARY KEY,
Filename nvarchar(255) NOT NULL
);
CREATE TABLE Keywords (
ID int IDENTITY(1,1) PRIMARY KEY,
DocumentID int NOT NULL,
Keyword nvarchar(255) NOT NULL
);
INSERT INTO Documents(Filename) VALUES
('document1.pdf'), ('document2.pdf'), ('document3.pdf');
INSERT INTO Keywords(DocumentID, Keyword) VALUES
(1, 'KeywordA'),
(1, 'KeywordB'),
(1, 'KeywordC'),
(2, 'KeywordB'),
(3, 'KeywordA'),
(3, 'KeywordD');
I'm looking for a way to get all documents matching a certain keyword.
This could be e.g. written with the following T-SQL query:
SELECT Documents.*
FROM Documents
WHERE Documents.ID IN
(
SELECT Keywords.DocumentID
FROM Keywords
WHERE Keywords.Keyword = 'KeywordA'
)
This works successfully.
What I'm currently stuck with is when I want to find all documents that match multiple keyword, combined with logical AND.
E.g. find a document that has three detail records with keyword A, B and C.
I think the following might work, but I don't know whether this performant or elegant at all:
SELECT Documents.*
FROM Documents
WHERE Documents.ID IN
(
SELECT Keywords.DocumentID
FROM Keywords
WHERE
Keywords.Keyword = 'KeywordA' OR
Keywords.Keyword = 'KeywordB'
GROUP BY Keywords.DocumentID HAVING COUNT(*) = 2
)
How to write a (performant) SQL query to find all documents that have multiple keywords associated.
If it is easier, a solution with a constant number of keywords (e.g. 3) would be sufficient.
Upvotes: 1
Views: 506
Reputation: 13237
I hope the following query can help you
SELECT D.ID
FROM Documents D
JOIN Keywords K ON K.DocumentID = D.ID
WHERE K.Keyword IN ('KeywordA', 'KeywordB', 'KeywordC')
GROUP BY D.ID
HAVING COUNT(DISTINCT K.Keyword) = 3
Upvotes: 1
Reputation: 72153
The technique you are trying to do is called Relational Division With Remainder, in other words: find all groups which contain a particular set of rows.
Your current query is one of the standard ways of doing this, there are others.
If you had the keywords in a table variable or TVP, ...
DECLARE @keywords AS TABLE (Keyword varchar(50));
INSERT @keywords VALUES
('KeywordA'), ('KeywordB'), ('KeywordC');
... you could make it much neater with the following:
SELECT d.*
FROM Documents d
WHERE d.ID IN
(
SELECT k.DocumentID
FROM Keywords k
JOIN @keywords kt ON kt.Keyword = k.Keyword
GROUP BY k.DocumentID
HAVING COUNT(*) = (SELECT COUNT(*) FROM @keywords)
);
Another option:
SELECT d.*
FROM Documents d
WHERE EXISTS (SELECT 1
FROM @keywords kt
LEFT JOIN Keywords k ON kt.Keyword = k.Keyword
AND k.DocumentID = d.ID
HAVING COUNT(*) = COUNT(k.keywords) -- there are no missing matches
);
And another, slightly confusing one:
SELECT d.*
FROM Documents d
WHERE NOT EXISTS (SELECT 1
FROM @keywords kt
WHERE NOT EXISTS (SELECT 1
FROM Keywords k
WHERE k.Keyword = kt.Keyword
AND K.DocumentID = d.ID
)
);
-- For each document, there are no keywords for which there is no match
Upvotes: 1