Balaji Govindan
Balaji Govindan

Reputation: 11

AWS redshift range join slow

I want to identify the location (country, city, area, latitude, longitude) details of clients who visit my site based on the client IP address for my analysis. We have IP address data with details of range of IP (fromIP,toIP) for each of the location. I will be running a job every day overnight to identify and update the location details for that day in my table which stores the client visit details.

Below is my table structure

CREATE TABLE CLIENT_VISIT
(VISIT_ID VARCHAR(65535), 
CLIENT_ID INTEGER,
CLIENTIP_VALUE BIGINT, 
VISIT_DATE TIMESTAMP WITHOUT TIME ZONE,
COUNTRY VARCHAR(180), 
CITY VARCHAR(180),
AREA VARCHAR(180), 
LATITUDE VARCHAR(60), 
LONGITUDE VARCHAR(60))
SORTKEY (   VISIT_DATE  );
    
CREATE TABLE IP_ADDRESS_DETAILS
(
FROM_IP_VALUE BIGINT,
TO_IP_VALUE BIGINT,
COUNTRY VARCHAR(180), 
CITY VARCHAR(180),
AREA VARCHAR(180), 
LATITUDE VARCHAR(60), 
LONGITUDE VARCHAR(60)
)SORTKEY (
    FROM_IP_VALUE,TO_IP_VALUE
    );

Below is my query

    SELECT cv.visit_id,cv.client_id ,ida.country,ida.city,ida.area,
        ida.latitude,ida.longitude
    FROM   client_visit cv,ip_address_details ida
    WHERE  cv.visit_date  >= trunc(sysdate-2)
    AND cv.visit_date  < trunc(sysdate-1)
    AND (cv.clientip_value >=  ida.from_ip_value) 
    AND (cv.clientip_value < ida.to_ip_value)

The table client_visit contains about 50 million rows. The table ip_address_details contains about 4 million rows.

Stand alone when i run the query on client_visit with the visit_date filter alone i get the result quickly. Result is about 100 thousand rows. But on joining with ip_address_details the query hangs and does not return even after 1hr.

I have tried with DISTSTYLE KEY and DISTKEY (to_ip_value) on ip_address_details table.

I have also tried creating Primary and foreign key on tables ip_address_details and client_visit on the IP address column to influence the query execution path. I have tried variety of other options of re-writing the query modifying the DISTSTYLE and DISTKEY.

Still when i try to identify the location details my query hangs. The plan always shows nested loop join and seq scan. I am relatively new to redshift. Any ideas/suggestions on how this can be improved. Any pointers on this will be of great help.

Below is the plan

XN Nested Loop DS_BCAST_INNER  (cost=70225.77..714363120.75 rows=53406584 width=606)
  Join Filter: (("inner".clientip_value < "outer".to_ip_value) AND ("inner".clientip_value >= "outer".from_ip_value))
  ->  XN Seq Scan on ip_address_details ida  (cost=0.00..36413.58 rows=3641358 width=98)
  ->  XN Materialize  (cost=70225.77..70227.09 rows=132 width=524)
        ->  XN Seq Scan on client_visit cv  (cost=0.00..1.64 rows=132 width=524)
                      Filter: ((process_time >= '2025-01-31 00:00:00'::timestamp without time zone) AND (process_time < '2025-02-01 00:00:00'::timestamp without time zone))
----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----

I have tried with creating primary key on IP_ADDRESS_DETAILS.TO_IP_VALUE and foreign key on CLIENT_VISIT.CLIENTIP_VALUE referencing the TO_IP_VALUE also. From the data i know below condition will get max of 1 row

AND (cv.clientip_value >=  ida.from_ip_value) 
AND (cv.clientip_value < ida.to_ip_value)

Upvotes: 0

Views: 64

Answers (1)

Bill Weiner
Bill Weiner

Reputation: 11082

It looks very likely that you are creating trillion (or more) of rows in this JOIN operation. Creating the manipulating this amount of data will be very slow as it likely will cause spill. Run EXPLAIN on your query and look at the number of rows (n) that it estimates this JOIN will create. I expect you need to qualify this JOIN further.

Upvotes: 0

Related Questions