Pandafreak
Pandafreak

Reputation: 107

Three tables and two condition filter in Snowflake SQL

account_table as acc:

old_accountID new_accountID
123 345
678 910

userid_table as activ:

account_name account_id
Musk1 123
Musk2 345
Zuckerberg1 678
Zuckerberg2 910

activity_table as activ:

account_name active_l30 last_activity_year
Musk1 FALSE 2009
Musk2 TRUE 2022
Zuckerberg1 FALSE 2021
Zuckerberg2 TRUE 2022

Final output I want:

old_accountID new_accountID
678 910

I'd like to apply two conditions to my account_table 1) old accountID needs to have had activity in the last two years e.g. Musk1 wouldn't pull in because last_activity year is more than 2 years ago in 2009 and 2) new_accountID is active_l30 (according to my activity_table).

I have written a query that joins account_table and userid_table and activity_table to filter for all old_accountID that were active in the last 2 years. But how do I apply a "second filter" so that I'm basically also able to filter for only those new_accountID that are active_l30=TRUE?

Here is what I've got so far:

Select u.account_id, u.account_name
from userid_table as u
JOIN activity_table as activ on u.account_name=activ.account_name
where u.account_id in (select old_accountID from account_table)
and last_activity_year>2020

Upvotes: 1

Views: 587

Answers (1)

Simeon Pilgrim
Simeon Pilgrim

Reputation: 25903

So turning you data into some CTE so I don't have to create tables:

with account_table(old_accountID, new_accountID) as (
    select * from values
        (123, 345),
        (678, 910)
), userid_table(account_name, account_id) as (
    select * from values
        ('Musk1', 123),
        ('Musk2', 345),
        ('Zuckerberg1', 678),
        ('Zuckerberg2', 910)
), activity_table(account_name, active_l30, last_activity_year) as (
    select * from values
        ('Musk1', FALSE,    2009),
        ('Musk2', TRUE, 2022),
        ('Zuckerberg1', FALSE,  2021),
        ('Zuckerberg2', TRUE,   2022)
)

And changing your SQL a little bit, from a WHERE IN to just a JOIN (assuming there are no duplicates on the old_accountID)

Select 
    u.account_id, 
    u.account_name
    ,ac.last_activity_year>2020 as act_last_2year
    ,ac.active_l30
from userid_table as u
join account_table as a
    on a.old_accountID = u.account_id
JOIN activity_table as ac 
    on u.account_name = ac.account_name

and printing the two rules you want to check:

ACCOUNT_ID ACCOUNT_NAME ACT_LAST_2YEAR ACTIVE_L30
123 Musk1 FALSE FALSE
678 Zuckerberg1 TRUE FALSE

so you could have these are WHERE clauses and with your current data you would have no results.

Another way:

So if we take your question one at a time:

Find accounts that where active in the last two years:

select *
    ,ac.last_activity_year>2020 as act_last_2year
from userid_table as u
join activity_table as ac
    on u.account_name = ac.account_name
ACCOUNT_NAME ACCOUNT_ID ACCOUNT_NAME_2 ACTIVE_L30 LAST_ACTIVITY_YEAR ACT_LAST_2YEAR
Musk1 123 Musk1 FALSE 2,009 FALSE
Musk2 345 Musk2 TRUE 2,022 TRUE
Zuckerberg1 678 Zuckerberg1 FALSE 2,021 TRUE
Zuckerberg2 910 Zuckerberg2 TRUE 2,022 TRUE

Then there is an unspoken rule, that the data implies to me, people can have many account and lets treat them as one. The logic being for those "active in the last two year" if that prior activity was the old account, the person, was still active, just an account jumper.. So how I am going to do this is the UNION of userid with old, and just userid.

select u.*
    ,a.old_accountID as link_id
from userid_table as u
join account_table as a
    on u.account_id = a.new_accountID
union all
select u.*
    ,u.account_id as link_id
from userid_table as u
ACCOUNT_NAME ACCOUNT_ID LINK_ID
Musk2 345 123
Zuckerberg2 910 678
Musk1 123 123
Musk2 345 345
Zuckerberg1 678 678
Zuckerberg2 910 910

Now the first result was userid x active in the last 2 year, can be joined on all the LINK_id's so every current account has the prior accounts goodness mapped to it (this method assumes, you have a one to one of account histories a->b->c is expressed as a->b,b->c,a->c otherwise this has to be unrolled from the data).

thus:

select h.*
    ,ac.*
    ,ac.last_activity_year>2020 as act_last_2year
from (
    select u.*
        ,a.old_accountID as link_id
    from userid_table as u
    join account_table as a
        on u.account_id = a.new_accountID
    union all
    select u.*
        ,u.account_id as link_id
    from userid_table as u
) as h
join userid_table as u 
    on u.account_id = h.link_id
join activity_table as ac
    on u.account_name = ac.account_name
order by 1,3

gives us:

ACCOUNT_NAME ACCOUNT_ID LINK_ID ACCOUNT_NAME_2 ACTIVE_L30 LAST_ACTIVITY_YEAR ACT_LAST_2YEAR
Musk1 123 123 Musk1 FALSE 2,009 FALSE
Musk2 345 123 Musk1 FALSE 2,009 FALSE
Musk2 345 345 Musk2 TRUE 2,022 TRUE
Zuckerberg1 678 678 Zuckerberg1 FALSE 2,021 TRUE
Zuckerberg2 910 678 Zuckerberg1 FALSE 2,021 TRUE
Zuckerberg2 910 910 Zuckerberg2 TRUE 2,022 TRUE

Now we have the status of each userid and if that account has historically been active, we can join to the activity_table again (for the current account) to find the l30 active use. By the way, this is all over verbose, just so it can be tracked the thought progression, but we will push things into tighter SQL once we accept why/what we are doing.

select h.*
    ,ac_h.last_activity_year
    ,ac_h.last_activity_year>2020 as act_last_2year
    ,ac.active_l30
from (
    select u.*
        ,a.old_accountID as link_id
    from userid_table as u
    join account_table as a
        on u.account_id = a.new_accountID
    union all
    select u.*
        ,u.account_id as link_id
    from userid_table as u
) as h
join userid_table as u 
    on u.account_id = h.link_id
join activity_table as ac_h
    on u.account_name = ac_h.account_name
join activity_table as ac
    on h.account_name =  ac.account_name
order by 1,3

gives:

ACCOUNT_NAME ACCOUNT_ID LINK_ID LAST_ACTIVITY_YEAR ACT_LAST_2YEAR ACTIVE_L30
Musk1 123 123 2,009 FALSE FALSE
Musk2 345 123 2,009 FALSE TRUE
Musk2 345 345 2,022 TRUE TRUE
Zuckerberg1 678 678 2,021 TRUE FALSE
Zuckerberg2 910 678 2,021 TRUE TRUE
Zuckerberg2 910 910 2,022 TRUE TRUE

So now we can add the WHERE clause:

where act_last_2year and ac.active_l30

which gives:

ACCOUNT_NAME ACCOUNT_ID LINK_ID LAST_ACTIVITY_YEAR ACT_LAST_2YEAR ACTIVE_L30
Musk2 345 345 2,022 TRUE TRUE
Zuckerberg2 910 678 2,021 TRUE TRUE
Zuckerberg2 910 910 2,022 TRUE TRUE

so now we can see that both Zuckerberg2 accounts (910 & 678) where active in the prior two years. So this can be fixed with a QUALIFY ROW_NUMBER to pick the newest.

qualify row_number() over (partition by h.account_name order by link_id desc) = 1
ACCOUNT_NAME ACCOUNT_ID LINK_ID LAST_ACTIVITY_YEAR ACT_LAST_2YEAR ACTIVE_L30
Musk2 345 345 2,022 TRUE TRUE
Zuckerberg2 910 910 2,022 TRUE TRUE

I would say "Excellent, now to clean this all up." but looking back at the answer I gave, I have just thrown away the one thing you wanted to keep.

What you actually asked for:

so lets flip all how I would answer you question on it's head, and just go fetch the values you seem to want:

select a.*
from account_table as a
join userid_table as ou
    on a.old_accountID = ou.account_id
join userid_table as nu
    on a.new_accountID = nu.account_id
join activity_table as oact
    on ou.account_name = oact.account_name
        and last_activity_year > 2020
join activity_table as nact
    on nu.account_name = nact.account_name
        and nact.active_l30;
OLD_ACCOUNTID NEW_ACCOUNTID
678 910

Okay, that could have been a lot shorter..

Upvotes: 1

Related Questions