Ian
Ian

Reputation: 1622

Need help understanding why multiple left joins aren't returning in snowflake

Having some issues with multiple left joins not doing what I expect them to!

select 
    sent.id,
    sent.ts,
    sent.email,
    delivered.ts,
    type.label,
    min(opens.ts) as first_open,
    count(opens.id) as open_count,
    min(clicks.ts) as first_click,
    count(clicks.id) as click_count
from sent
inner join type on type.id = sent.type_id
left outer join delivered on (delivered.id = sent.id)
left outer join opens on (opens.id = sent.id)
left outer join clicks on (clicks.id = sent.id)
where sent.id = 'a1b1c1d1e1'
group by 
    sent.id,
    sent.ts,
    sent.email,
    delivered.ts,
    type.label,
    opens.id,
    clicks.id
;

A message is sent, then delivered; that's 1 to 1, but, delivered may not exist.

A message can then be opened (multiple times) and clicked (multiple times), all tied together with the sent.id.

If I just have the opens join, it works fine however, likewise if I just have the clicks join.

When I add the clicks join the first_click and click_count show the same values as open.

I get:

1,2020-01-01 00:00:00,[email protected],2020-01-01 00:00:00,test,2020-01-01 01:00:00,4,2020-01-01 01:00:00,4

When it should be:

1,2020-01-01 00:00:00,[email protected],2020-01-01 00:00:00,test,2020-01-01 01:00:00,4,2020-01-01 02:00:00,1

I've tried running with no query cache (ALTER SESSION SET USE_CACHED_RESULT = false;) and done a basic mirror in MySQL to prove the join and it's fine.

Upvotes: -1

Views: 5251

Answers (1)

Simeon Pilgrim
Simeon Pilgrim

Reputation: 25978

so trying to bridge the gap between you problem description and the results you mention

start with known data

create or replace table sent (id text, ts timestamp_ntz, email text, type_id number);
create or replace table type (id number, label text);
create or replace table delivered(id text, ts timestamp_ntz);
create or replace table opens(id text, ts timestamp_ntz);
create or replace table clicks(id text, ts timestamp_ntz);

insert into sent values ('a1b1c1d1e1', '2020-01-01 01:00', '[email protected]', 1);
insert into delivered values ('a1b1c1d1e1', '2020-01-01 02:00');
insert into type values (1, 'test');
insert into opens values ('a1b1c1d1e1', '2020-01-01 03:00'),('a1b1c1d1e1', '2020-01-01 04:00'),('a1b1c1d1e1', '2020-01-01 05:00'),('a1b1c1d1e1', '2020-01-01 06:00');
insert into clicks values ('a1b1c1d1e1', '2020-01-01 07:00');

select 
    sent.id
    ,sent.ts
    ,sent.email
    ,delivered.ts
    ,type.label
    ,min(opens.ts) as first_open
    ,count(opens.id) as open_count
    ,min(clicks.ts) as first_click
    ,count(clicks.id) as click_count
from sent
join type on type.id = sent.type_id
left join delivered on (delivered.id = sent.id)
left join opens on (opens.id = sent.id)
left join clicks on (clicks.id = sent.id)
where sent.id = 'a1b1c1d1e1'
group by 1,2,3,4, 5;

I swapped the columns names to their position, because I like it that way, but you don't need to opens.id or clicks.id as those are a not selected in a non-aggregates colunms.

 ID TS  EMAIL   TS  LABEL   FIRST_OPEN  OPEN_COUNT  FIRST_CLICK CLICK_COUNT
 a1b1c1d1e1 2020-01-01 01:00:00.000 [email protected] 2020-01-01 02:00:00.000 test    2020-01-01 03:00:00.000 4   2020-01-01 07:00:00.000 4

I am not sure what join behavior you are changing.. but it might be helpful to print all the rows and see what happening to understand why you are not getting what you expect.

select 
    sent.id
    ,sent.ts
    ,sent.email
    ,delivered.ts
    ,type.label
    ,opens.ts as open_ts
    ,clicks.ts as click_ts
    --,min(opens.ts) as first_open
    --,count(opens.id) as open_count
    --,min(clicks.ts) as first_click
    --,count(clicks.id) as click_count
from sent
join type on type.id = sent.type_id
left join delivered on (delivered.id = sent.id)
left join opens on (opens.id = sent.id)
left join clicks on (clicks.id = sent.id)
where sent.id = 'a1b1c1d1e1'
--group by 1,2,3,4, 5;

gives for me:

 ID TS  EMAIL   TS  LABEL   OPEN_TS CLICK_TS
 a1b1c1d1e1 2020-01-01 01:00:00.000 [email protected] 2020-01-01 02:00:00.000 test    2020-01-01 03:00:00.000 2020-01-01 07:00:00.000
 a1b1c1d1e1 2020-01-01 01:00:00.000 [email protected] 2020-01-01 02:00:00.000 test    2020-01-01 04:00:00.000 2020-01-01 07:00:00.000
 a1b1c1d1e1 2020-01-01 01:00:00.000 [email protected] 2020-01-01 02:00:00.000 test    2020-01-01 05:00:00.000 2020-01-01 07:00:00.000
 a1b1c1d1e1 2020-01-01 01:00:00.000 [email protected] 2020-01-01 02:00:00.000 test    2020-01-01 06:00:00.000 2020-01-01 07:00:00.000

which is what I would expect from ether LEFT or normal INNER joins.. feel free to update with the SQL the give you the broken results, and a listed version of of the output like above, to get a better explanation.

Upvotes: 2

Related Questions