Reputation:
I have a large Redshift tables (hundreds of millions of ROWS with ~50 columns per row).
There is a need for me to find rows that have duplicate columns for a specific value.
Example:
if my table has the columns 'column_of_interest' and 'date_time', In those hundreds of millions of columns, I need to find all the instances where 'column_of_interest' has more than one value between a certain 'date_time'.
eg:
column_of_interest date_time
ROW 1: ABCD-1234 165895896565
ROW 2: FCEG-3434 165895896577
ROW 3: ABCD-1234 165895986688
ROW 4: ZZZZ-9999 165895986689
ROW 5: ZZZZ-9999 165895987790
in the above.. since ROW 1
and ROW 3
have the same column_of_interest
i would like that column_of_interest
returned. and ROW 4
and ROW 5
as well, so i would like those returned.
So the end result would be:
duplicates
ABCD-1234
ZZZZ-9999
I have found a few things online, but the table is so large, the query times about before any results are returned. Am I going about this the wrong way? Here are a couple that I tried just to get the results back (but they timeout before returning).
SELECT column_of_interest, COUNT(*)
FROM my_table
GROUP BY column_of_interest
HAVING COUNT(*) > 1
WHERE date_time >= 1601510400000 AND date_time < 1601596800000
LIMIT 200
SELECT a.*
FROM my_table a
JOIN (SELECT column_of_interest, COUNT(*)
FROM my_table
GROUP BY column_of_interest
HAVING count(*) > 1 ) b
ON a.column_of_interest = b.column_of_interest
ORDER BY a.column_of_interest
LIMIT 200
Upvotes: 1
Views: 2108
Reputation: 1270391
This should be a fine method. And it should not "time out". Your version has a syntax error.
So try:
SELECT column_of_interest, COUNT(*)
FROM my_table
WHERE date_time >= 1601510400000 AND date_time < 1601596800000
GROUP BY column_of_interest
HAVING COUNT(*) > 1
LIMIT 200
Upvotes: 1