Reputation: 107
account_table as acc:
old_accountID | new_accountID |
---|---|
123 | 345 |
678 | 910 |
userid_table as activ:
account_name | account_id |
---|---|
Musk1 | 123 |
Musk2 | 345 |
Zuckerberg1 | 678 |
Zuckerberg2 | 910 |
activity_table as activ:
account_name | active_l30 | last_activity_year |
---|---|---|
Musk1 | FALSE | 2009 |
Musk2 | TRUE | 2022 |
Zuckerberg1 | FALSE | 2021 |
Zuckerberg2 | TRUE | 2022 |
Final output I want:
old_accountID | new_accountID |
---|---|
678 | 910 |
I'd like to apply two conditions to my account_table 1) old accountID needs to have had activity in the last two years e.g. Musk1 wouldn't pull in because last_activity year is more than 2 years ago in 2009 and 2) new_accountID is active_l30 (according to my activity_table).
I have written a query that joins account_table and userid_table and activity_table to filter for all old_accountID that were active in the last 2 years. But how do I apply a "second filter" so that I'm basically also able to filter for only those new_accountID that are active_l30=TRUE?
Here is what I've got so far:
Select u.account_id, u.account_name
from userid_table as u
JOIN activity_table as activ on u.account_name=activ.account_name
where u.account_id in (select old_accountID from account_table)
and last_activity_year>2020
Upvotes: 1
Views: 587
Reputation: 25903
So turning you data into some CTE so I don't have to create tables:
with account_table(old_accountID, new_accountID) as (
select * from values
(123, 345),
(678, 910)
), userid_table(account_name, account_id) as (
select * from values
('Musk1', 123),
('Musk2', 345),
('Zuckerberg1', 678),
('Zuckerberg2', 910)
), activity_table(account_name, active_l30, last_activity_year) as (
select * from values
('Musk1', FALSE, 2009),
('Musk2', TRUE, 2022),
('Zuckerberg1', FALSE, 2021),
('Zuckerberg2', TRUE, 2022)
)
And changing your SQL a little bit, from a WHERE IN to just a JOIN (assuming there are no duplicates on the old_accountID
)
Select
u.account_id,
u.account_name
,ac.last_activity_year>2020 as act_last_2year
,ac.active_l30
from userid_table as u
join account_table as a
on a.old_accountID = u.account_id
JOIN activity_table as ac
on u.account_name = ac.account_name
and printing the two rules you want to check:
ACCOUNT_ID | ACCOUNT_NAME | ACT_LAST_2YEAR | ACTIVE_L30 |
---|---|---|---|
123 | Musk1 | FALSE | FALSE |
678 | Zuckerberg1 | TRUE | FALSE |
so you could have these are WHERE clauses and with your current data you would have no results.
So if we take your question one at a time:
Find accounts that where active in the last two years:
select *
,ac.last_activity_year>2020 as act_last_2year
from userid_table as u
join activity_table as ac
on u.account_name = ac.account_name
ACCOUNT_NAME | ACCOUNT_ID | ACCOUNT_NAME_2 | ACTIVE_L30 | LAST_ACTIVITY_YEAR | ACT_LAST_2YEAR |
---|---|---|---|---|---|
Musk1 | 123 | Musk1 | FALSE | 2,009 | FALSE |
Musk2 | 345 | Musk2 | TRUE | 2,022 | TRUE |
Zuckerberg1 | 678 | Zuckerberg1 | FALSE | 2,021 | TRUE |
Zuckerberg2 | 910 | Zuckerberg2 | TRUE | 2,022 | TRUE |
Then there is an unspoken rule, that the data implies to me, people can have many account and lets treat them as one. The logic being for those "active in the last two year" if that prior activity was the old account, the person, was still active, just an account jumper.. So how I am going to do this is the UNION of userid with old, and just userid.
select u.*
,a.old_accountID as link_id
from userid_table as u
join account_table as a
on u.account_id = a.new_accountID
union all
select u.*
,u.account_id as link_id
from userid_table as u
ACCOUNT_NAME | ACCOUNT_ID | LINK_ID |
---|---|---|
Musk2 | 345 | 123 |
Zuckerberg2 | 910 | 678 |
Musk1 | 123 | 123 |
Musk2 | 345 | 345 |
Zuckerberg1 | 678 | 678 |
Zuckerberg2 | 910 | 910 |
Now the first result was userid x active in the last 2 year, can be joined on all the LINK_id's so every current account has the prior accounts goodness mapped to it (this method assumes, you have a one to one of account histories a->b->c
is expressed as a->b,b->c,a->c
otherwise this has to be unrolled from the data).
thus:
select h.*
,ac.*
,ac.last_activity_year>2020 as act_last_2year
from (
select u.*
,a.old_accountID as link_id
from userid_table as u
join account_table as a
on u.account_id = a.new_accountID
union all
select u.*
,u.account_id as link_id
from userid_table as u
) as h
join userid_table as u
on u.account_id = h.link_id
join activity_table as ac
on u.account_name = ac.account_name
order by 1,3
gives us:
ACCOUNT_NAME | ACCOUNT_ID | LINK_ID | ACCOUNT_NAME_2 | ACTIVE_L30 | LAST_ACTIVITY_YEAR | ACT_LAST_2YEAR |
---|---|---|---|---|---|---|
Musk1 | 123 | 123 | Musk1 | FALSE | 2,009 | FALSE |
Musk2 | 345 | 123 | Musk1 | FALSE | 2,009 | FALSE |
Musk2 | 345 | 345 | Musk2 | TRUE | 2,022 | TRUE |
Zuckerberg1 | 678 | 678 | Zuckerberg1 | FALSE | 2,021 | TRUE |
Zuckerberg2 | 910 | 678 | Zuckerberg1 | FALSE | 2,021 | TRUE |
Zuckerberg2 | 910 | 910 | Zuckerberg2 | TRUE | 2,022 | TRUE |
Now we have the status of each userid and if that account has historically been active, we can join to the activity_table again (for the current account) to find the l30 active use. By the way, this is all over verbose, just so it can be tracked the thought progression, but we will push things into tighter SQL once we accept why/what we are doing.
select h.*
,ac_h.last_activity_year
,ac_h.last_activity_year>2020 as act_last_2year
,ac.active_l30
from (
select u.*
,a.old_accountID as link_id
from userid_table as u
join account_table as a
on u.account_id = a.new_accountID
union all
select u.*
,u.account_id as link_id
from userid_table as u
) as h
join userid_table as u
on u.account_id = h.link_id
join activity_table as ac_h
on u.account_name = ac_h.account_name
join activity_table as ac
on h.account_name = ac.account_name
order by 1,3
gives:
ACCOUNT_NAME | ACCOUNT_ID | LINK_ID | LAST_ACTIVITY_YEAR | ACT_LAST_2YEAR | ACTIVE_L30 |
---|---|---|---|---|---|
Musk1 | 123 | 123 | 2,009 | FALSE | FALSE |
Musk2 | 345 | 123 | 2,009 | FALSE | TRUE |
Musk2 | 345 | 345 | 2,022 | TRUE | TRUE |
Zuckerberg1 | 678 | 678 | 2,021 | TRUE | FALSE |
Zuckerberg2 | 910 | 678 | 2,021 | TRUE | TRUE |
Zuckerberg2 | 910 | 910 | 2,022 | TRUE | TRUE |
So now we can add the WHERE clause:
where act_last_2year and ac.active_l30
which gives:
ACCOUNT_NAME | ACCOUNT_ID | LINK_ID | LAST_ACTIVITY_YEAR | ACT_LAST_2YEAR | ACTIVE_L30 |
---|---|---|---|---|---|
Musk2 | 345 | 345 | 2,022 | TRUE | TRUE |
Zuckerberg2 | 910 | 678 | 2,021 | TRUE | TRUE |
Zuckerberg2 | 910 | 910 | 2,022 | TRUE | TRUE |
so now we can see that both Zuckerberg2 accounts (910 & 678) where active in the prior two years. So this can be fixed with a QUALIFY ROW_NUMBER to pick the newest.
qualify row_number() over (partition by h.account_name order by link_id desc) = 1
ACCOUNT_NAME | ACCOUNT_ID | LINK_ID | LAST_ACTIVITY_YEAR | ACT_LAST_2YEAR | ACTIVE_L30 |
---|---|---|---|---|---|
Musk2 | 345 | 345 | 2,022 | TRUE | TRUE |
Zuckerberg2 | 910 | 910 | 2,022 | TRUE | TRUE |
I would say "Excellent, now to clean this all up." but looking back at the answer I gave, I have just thrown away the one thing you wanted to keep.
so lets flip all how I would answer you question on it's head, and just go fetch the values you seem to want:
select a.*
from account_table as a
join userid_table as ou
on a.old_accountID = ou.account_id
join userid_table as nu
on a.new_accountID = nu.account_id
join activity_table as oact
on ou.account_name = oact.account_name
and last_activity_year > 2020
join activity_table as nact
on nu.account_name = nact.account_name
and nact.active_l30;
OLD_ACCOUNTID | NEW_ACCOUNTID |
---|---|
678 | 910 |
Okay, that could have been a lot shorter..
Upvotes: 1