How to find pairs in the below given table

Question

Find all pairs of frequent words that occur in the same document id and report the number of documents the pair occurs in. Report the pairs in decreasing order of frequency.

Note there should not be any replicated entries like o (truck, boat) (truck, boat)
Note you should not have the same pair occurring twice in opposite order. Only one of the following should occur: o (truck, boat) (boat, truck)

+-------+-----+-----+---------+
|vocabId|docId|count|     word|
+-------+-----+-----+---------+
|      1|    1| 1000|    plane|
|      1|    3|  100|    plane|
|      3|    1| 1200|motorbike|
|      3|    2|  702|motorbike|
|      3|    3|  600|motorbike|
|      5|    3| 2000|     boat|
|      5|    2|  200|     boat|
+-------+-----+-----+---------+

I have used this query but it is giving me the wrong result

select r1.word,r2.word, count(*) 
from result_T r1 
JOIN result_T r2 ON r1.docId = r2.docId 
and r1.word = r2.word group by r1.word, r2.word

Expected Ouput:

boat, motorbike, 2
motorbike, plane, 2
boat, plane, 1

Tim Biegeleisen · Accepted Answer

You were on the right track with a self-join, but the join logic needs to change a bit. The join condition should be that the first word is lexicographically less than the second word. This ensures that pairs will not be double-counted. Also, the document IDs have to match (you were already checking for this).

SELECT
    r1.word,
    r2.word,
    COUNT(*) AS cnt
FROM result_T r1
INNER JOIN result_T r2
    ON r1.word < r2.word AND
       r1.docId = r2.docId
GROUP BY
    r1.word,
    r2.word
ORDER BY
    COUNT(*) DESC;

How to find pairs in the below given table

Answers (2)

Demo

Related Questions