Reputation: 135
I am working with a messy source dataset that I want to make an nice, easy repeatable classification system on to make analysis easier for me (and easier to share with my team). My source table has seven columns that I need to evaluate and my lookup table has multiple values I'll use to classify when there is a match. My source table has millions of rows a day. I'm able to accomplish my classification using a CASE WHEN but it's getting too big for a UDF now so my goal is to store the lookup information in a table and then run a LEFT JOIN to classify the data. Not all data will be classified, so I need a (no match) for the values that don't match. I tried doing a LEFT JOIN with REGEXP_CONTAINS in the ON and it is taking forever and sometimes just doesn't process.
My question is how can I make a replicable lookup that can expand and is centralized (so I'm not copying a CASE WHEN from a text file or something).
My source file looks like:
ID | category | action | label | eventName | screenTitle | pageTitle | pagePlatform
-------------------------------------------------------------------------------------------
1 | NULL | click | button | buttonClick | NULL | NULL | CMS
2 | purchase | NULL | tree | buttonClick | NULL | tree | CMS
3 | NULL | NULL | NULL | screen_view | water | NULL | NULL
My CASE WHENs are something like:
WHEN eventName = 'screen_view' AND screenTitle = 'water' THEN 'Water Screen'
WHEN category = 'purchase' THEN 'Purchase'
When I tried a lookup table I had all the columns from my source table in the lookup and put . where the value didn't matter.
Would love any ideas so I can keep a centralized spot for the logic.
Upvotes: 0
Views: 367
Reputation: 173106
Consider below approach (BigQuery Standard SQL)
with lookup_table as (
select '%"eventName":"screen_view"%,%"screenTitle":"water"%' as conditions, 'Water Screen' result union all
select '%"category":"purchase"%', 'Purchase'
)
select any_value(t).*, max(result) as result
from (
select t,
if(to_json_string(t) like all (conditions), result, null) as result
from your_table t, lookup_table
)
group by to_json_string(t)
if applied to sample data in your question - output is
Upvotes: 2