jatin mittal
jatin mittal

Reputation: 11

How to find pairs in the below given table

Find all pairs of frequent words that occur in the same document id and report the number of documents the pair occurs in. Report the pairs in decreasing order of frequency.

+-------+-----+-----+---------+
|vocabId|docId|count|     word|
+-------+-----+-----+---------+
|      1|    1| 1000|    plane|
|      1|    3|  100|    plane|
|      3|    1| 1200|motorbike|
|      3|    2|  702|motorbike|
|      3|    3|  600|motorbike|
|      5|    3| 2000|     boat|
|      5|    2|  200|     boat|
+-------+-----+-----+---------+

I have used this query but it is giving me the wrong result

select r1.word,r2.word, count(*) 
from result_T r1 
JOIN result_T r2 ON r1.docId = r2.docId 
and r1.word = r2.word group by r1.word, r2.word

Expected Ouput:

boat, motorbike, 2
motorbike, plane, 2
boat, plane, 1

Upvotes: 0

Views: 54

Answers (2)

Michał Turczyn
Michał Turczyn

Reputation: 37377

Try below query:

declare @tbl table (docId int, word varchar(20));
insert into @tbl values 
( 1,'plane'),
( 3,'plane'),
( 1,'motorbike'),
( 2,'motorbike'),
( 3,'motorbike'),
( 3,'boat'),
( 2,'boat');

select words, count(*) from (
    select distinct t1.docId,
           case when t1.word < t2.word then t1.word else t2.word end + ',' +
           case when t1.word >= t2.word then t1.word else t2.word end words
    from @tbl t1
    join @tbl t2 on t1.docId = t2.docId and t1.word <> t2.word
) a group by words
order by count(*) desc

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521457

You were on the right track with a self-join, but the join logic needs to change a bit. The join condition should be that the first word is lexicographically less than the second word. This ensures that pairs will not be double-counted. Also, the document IDs have to match (you were already checking for this).

SELECT
    r1.word,
    r2.word,
    COUNT(*) AS cnt
FROM result_T r1
INNER JOIN result_T r2
    ON r1.word < r2.word AND
       r1.docId = r2.docId
GROUP BY
    r1.word,
    r2.word
ORDER BY
    COUNT(*) DESC;

enter image description here

Demo

Upvotes: 1

Related Questions