Using group by and self join to find duplicates

Question

I have a table that i am attempting to query which looks like this:

t_documents
id
user_id
submitted_date
text
status

A user can have multiple documents in the documents table and is free to submit the same text multiple times. I wanted a way to see how many duplicate submissions are made per user. So for ex:

VALUES (1, 1234, 2016-07-05, "this is a test", 3)
VALUES (2, 1234, 2016-07-06, "this is a test", 3)
VALUES (3, 5678, 2016-07-07, "this is another test", 3)
VALUES (4, 5678, 2016-07-08, "this is another test", 3)

For the above set of data, I wanted a result that says give me a record for user 1234, the duplicate text and the number of times the duplicate text has been submitted. I have tried the following:

select oring.user_id, orig.text, COUNT(1) as dups
from t_documents orig
join t_documents another
on orig.user_id = another.user_id
and orig.text = another.text
group by user_id

The above is super rough and does not work. Can anyone please advise how to do what I want? Another query I would be interested in would be, in total, how many duplicate entries are there across all the users?

Tim Biegeleisen · Accepted Answer

I'm not sure that you need a self-join here. A simple GROUP BY on the user ID and text columns should suffice:

SELECT user_id, COUNT(*) AS dup_count
FROM t_documents
GROUP BY user_id, text

I assume here that you don't care about the posting date when determining whether a text be duplicate or not.

Edit:

If you want to find the total number of duplicates among all users then you can try the following query:

SELECT SUM(t.dup_count)
FROM
(
    SELECT user_id, COUNT(*) - COUNT(DISTINCT text) AS dup_count
    FROM t_documents
    GROUP BY user_id
) t

Using group by and self join to find duplicates

Answers (1)

Related Questions