user8072194
user8072194

Reputation:

Most Efficient way to Search Massive Redshift Table for Duplicate Values

I have a large Redshift tables (hundreds of millions of ROWS with ~50 columns per row).

There is a need for me to find rows that have duplicate columns for a specific value.

Example:

if my table has the columns 'column_of_interest' and 'date_time', In those hundreds of millions of columns, I need to find all the instances where 'column_of_interest' has more than one value between a certain 'date_time'.

eg:

      column_of_interest  date_time
  ROW 1: ABCD-1234        165895896565
  ROW 2: FCEG-3434        165895896577
  ROW 3: ABCD-1234        165895986688
  ROW 4: ZZZZ-9999        165895986689
  ROW 5: ZZZZ-9999        165895987790

in the above.. since ROW 1 and ROW 3 have the same column_of_interest i would like that column_of_interest returned. and ROW 4 and ROW 5 as well, so i would like those returned.

So the end result would be:

duplicates
ABCD-1234
ZZZZ-9999

I have found a few things online, but the table is so large, the query times about before any results are returned. Am I going about this the wrong way? Here are a couple that I tried just to get the results back (but they timeout before returning).

SELECT column_of_interest, COUNT(*)
FROM my_table
GROUP BY column_of_interest
HAVING COUNT(*) > 1
WHERE date_time >= 1601510400000 AND date_time < 1601596800000 
LIMIT 200

SELECT a.*
FROM my_table a
JOIN (SELECT column_of_interest, COUNT(*)
FROM my_table 
GROUP BY column_of_interest
HAVING count(*) > 1 ) b
ON a.column_of_interest = b.column_of_interest
ORDER BY a.column_of_interest
LIMIT 200

Upvotes: 1

Views: 2108

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270391

This should be a fine method. And it should not "time out". Your version has a syntax error.

So try:

SELECT column_of_interest, COUNT(*)
FROM my_table
WHERE date_time >= 1601510400000 AND date_time < 1601596800000 
GROUP BY column_of_interest
HAVING COUNT(*) > 1
LIMIT 200

Upvotes: 1

Related Questions