JAB
JAB

Reputation: 135

Creating a BigQuery lookup table with REGEXP_CONTAINS in the LEFT JOIN On

I am working with a messy source dataset that I want to make an nice, easy repeatable classification system on to make analysis easier for me (and easier to share with my team). My source table has seven columns that I need to evaluate and my lookup table has multiple values I'll use to classify when there is a match. My source table has millions of rows a day. I'm able to accomplish my classification using a CASE WHEN but it's getting too big for a UDF now so my goal is to store the lookup information in a table and then run a LEFT JOIN to classify the data. Not all data will be classified, so I need a (no match) for the values that don't match. I tried doing a LEFT JOIN with REGEXP_CONTAINS in the ON and it is taking forever and sometimes just doesn't process.

My question is how can I make a replicable lookup that can expand and is centralized (so I'm not copying a CASE WHEN from a text file or something).

My source file looks like:

    ID | category | action | label  | eventName | screenTitle | pageTitle | pagePlatform 
-------------------------------------------------------------------------------------------
     1 |  NULL     | click  | button | buttonClick | NULL      | NULL      | CMS
     2 |  purchase | NULL   | tree   | buttonClick | NULL      | tree      | CMS
     3 |  NULL     | NULL   | NULL   | screen_view | water     | NULL      | NULL

My CASE WHENs are something like:

WHEN eventName = 'screen_view' AND screenTitle = 'water' THEN 'Water Screen'
WHEN category = 'purchase' THEN 'Purchase'

When I tried a lookup table I had all the columns from my source table in the lookup and put . where the value didn't matter.

Would love any ideas so I can keep a centralized spot for the logic.

Upvotes: 0

Views: 367

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173106

Consider below approach (BigQuery Standard SQL)

with lookup_table as (
  select '%"eventName":"screen_view"%,%"screenTitle":"water"%' as conditions, 'Water Screen' result union all
  select  '%"category":"purchase"%', 'Purchase'
)
select any_value(t).*, max(result) as result 
from (
  select t,
    if(to_json_string(t) like all (conditions), result, null) as result
  from your_table t, lookup_table 
)
group by to_json_string(t)    

if applied to sample data in your question - output is

enter image description here

Upvotes: 2

Related Questions