Reputation: 3331
I have a table that i am attempting to query which looks like this:
t_documents
id
user_id
submitted_date
text
status
A user can have multiple documents in the documents table and is free to submit the same text multiple times. I wanted a way to see how many duplicate submissions are made per user. So for ex:
VALUES (1, 1234, 2016-07-05, "this is a test", 3)
VALUES (2, 1234, 2016-07-06, "this is a test", 3)
VALUES (3, 5678, 2016-07-07, "this is another test", 3)
VALUES (4, 5678, 2016-07-08, "this is another test", 3)
For the above set of data, I wanted a result that says give me a record for user 1234, the duplicate text and the number of times the duplicate text has been submitted. I have tried the following:
select oring.user_id, orig.text, COUNT(1) as dups
from t_documents orig
join t_documents another
on orig.user_id = another.user_id
and orig.text = another.text
group by user_id
The above is super rough and does not work. Can anyone please advise how to do what I want? Another query I would be interested in would be, in total, how many duplicate entries are there across all the users?
Upvotes: 1
Views: 1274
Reputation: 521289
I'm not sure that you need a self-join here. A simple GROUP BY
on the user ID and text columns should suffice:
SELECT user_id, COUNT(*) AS dup_count
FROM t_documents
GROUP BY user_id, text
I assume here that you don't care about the posting date when determining whether a text be duplicate or not.
Edit:
If you want to find the total number of duplicates among all users then you can try the following query:
SELECT SUM(t.dup_count)
FROM
(
SELECT user_id, COUNT(*) - COUNT(DISTINCT text) AS dup_count
FROM t_documents
GROUP BY user_id
) t
Upvotes: 3