Bilberryfm
Bilberryfm

Reputation: 537

Strange results when running SQL IN and NOT IN using Redshift

 , v2p AS (
      SELECT DISTINCT video_asin, product_asin 
      from video_table
      where video_asin IN (
            ‘ABC’,
            ‘CDF’,
            ‘DEF’,
            ‘FRW’)
  )
, video_data AS (
  SELECT distinct
      start_date::date,
      video_view,
      video_asin,
      page_asin
  FROM video_metrics
  WHERE 
   page_asin IN (select DISTINCT product_asin from v2p)
    AND start_date between TO_DATE('01/26/2018','MM-DD-YYYY') and TO_DATE('02/04/2018','MM-DD-YYYY') + 0.9999
) 
select sum(video_view) from video_data;

I’m running simple query, but I’m getting strange results with IN and NOT IN .

(1). When I do page_asin IN (select DISTINCT product_asin from v2p), I am getting 1,109,567 as select sum(video_view) from video_data;

(2). When I do the same query but with NOT IN page_asin NOT IN (select DISTINCT product_asin from v2p), I am getting 7,032,405 as sum.

(3). If I remove this whole line (page_asin IN (select DISTINCT product_asin from v2p),) to get all results, I am getting 8,148,803. as sum.

But I supposed that if I add (1) and (2) I should get (3). But in reality, I am getting: 1,109,567 + 7,032,405 = 8,141,972 AND NOT 8,148,803. from (3).

Why that happens? Why I am missing ~7000 video views?

Upvotes: 2

Views: 4889

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270713

A NULL value would be neither IN nor NOT IN a list.

So, page_asin must be NULL on some rows.

As a note: The select distinct is redundant in the subquery. There is no reason to include it.

Upvotes: 4

Related Questions