Redshift IN condition on thousands of values

Question

What's the best way to get data that matches any one of ~100k values?

For this question, I'm using an Amazon Redshift database and have a table something like this with hundreds of millions of rows:

--------------------
| userID | c1 | c2 |
| 101000 | 12 | 'a'|
| 101002 | 25 | 'b'|
____________________

There are also millions of unique userIDs. I have a CSV list of 98,000 userIDs that I care about, and I want to do math on the columns for those specific users.

select c1, c2 from table where userID in (10101, 10102, ...)

What's the best solution to match against a giant list like this?

My approach was to make a python script that read in the result of all users in our condition set, then filtering against the CSV in python. It was dead slow and wouldn't work in all scenarios though.

A coworker suggested uploading the 98k users into a temporary table, then joining against in in the query. This seems like the smartest way, but I wanted to ask if you all had ideas.

I also wondered if printing an insanely long SQL query containing all 98k users to match against and running it would work. Out of curiosity, would that even have ran?

Nathan Griffiths · Accepted Answer

As your coworker suggests, put your IDs into a temporary table by uploading a CSV to S3 and then using COPY to import the file into a table. You can then use an INNER JOIN condition to filter your main data table on the list of IDs you're interested in.

An alternative option, if uploading a file to S3 isn't possible for you, could be to use CREATE TEMP TABLE to set up a table for your list of IDs and then use a spreadsheet to generate a whole of INSERT statements to populate the temp table. 100K of inserts could be quite slow though.

Redshift IN condition on thousands of values

Answers (1)

Related Questions