Reputation: 11
Find all pairs of frequent words that occur in the same document id and report the number of documents the pair occurs in. Report the pairs in decreasing order of frequency.
+-------+-----+-----+---------+
|vocabId|docId|count| word|
+-------+-----+-----+---------+
| 1| 1| 1000| plane|
| 1| 3| 100| plane|
| 3| 1| 1200|motorbike|
| 3| 2| 702|motorbike|
| 3| 3| 600|motorbike|
| 5| 3| 2000| boat|
| 5| 2| 200| boat|
+-------+-----+-----+---------+
I have used this query but it is giving me the wrong result
select r1.word,r2.word, count(*)
from result_T r1
JOIN result_T r2 ON r1.docId = r2.docId
and r1.word = r2.word group by r1.word, r2.word
Expected Ouput:
boat, motorbike, 2
motorbike, plane, 2
boat, plane, 1
Upvotes: 0
Views: 54
Reputation: 37377
Try below query:
declare @tbl table (docId int, word varchar(20));
insert into @tbl values
( 1,'plane'),
( 3,'plane'),
( 1,'motorbike'),
( 2,'motorbike'),
( 3,'motorbike'),
( 3,'boat'),
( 2,'boat');
select words, count(*) from (
select distinct t1.docId,
case when t1.word < t2.word then t1.word else t2.word end + ',' +
case when t1.word >= t2.word then t1.word else t2.word end words
from @tbl t1
join @tbl t2 on t1.docId = t2.docId and t1.word <> t2.word
) a group by words
order by count(*) desc
Upvotes: 0
Reputation: 521457
You were on the right track with a self-join, but the join logic needs to change a bit. The join condition should be that the first word is lexicographically less than the second word. This ensures that pairs will not be double-counted. Also, the document IDs have to match (you were already checking for this).
SELECT
r1.word,
r2.word,
COUNT(*) AS cnt
FROM result_T r1
INNER JOIN result_T r2
ON r1.word < r2.word AND
r1.docId = r2.docId
GROUP BY
r1.word,
r2.word
ORDER BY
COUNT(*) DESC;
Upvotes: 1