Abhishek Rajeevan
Abhishek Rajeevan

Reputation: 43

Filtering the records with a left-join in BigQuery

I have two tables in Bigquery - raw1_data and mapping_data. The intention is to use the mapping_data table to map the products from the raw1_data table to produce the final_data table.

mapping_data

dataset country category brand new_category new_brand
raw1 A abc pqr bsq pqr
raw1 B xyz efg
raw1 X abc lmn bsq lmn
raw1 Y xyz hij xyz hij
raw2 X abc lmn abc lmn
raw3 B xyz efg xyz efg

raw_data

country category brand
A abc efg
B xyz efg
X abc hij
X abc lmn
Y xyz hij

final_data

country category brand status
A abc efg Unmapped
X abc hij Unmapped
X bsq lmn Mapped
Y xyz hij Mapped

The rules of mapping are -

  1. Match the country, category, brand columns from raw1_data to country,category,brand columns in mapping_data. Also, we need to match only to the rows where dataset column is "raw1" because mapping_data table is used by different datasets and has mapping criteria for different datasets.
  2. If the same country,category,brand is found in mapping_data table, then category and brand for the final_data table would be new_category and new_brand from the mapping_data table. The Status would be marked as 'Mapped'.
  3. If there is no match found in mapping_data table, the category and brand of the final_data table would remian the same as in raw1_data table. But Status would be 'Unmapped'.
  4. In case after matching the new_category and new_brand is NULL as seen for B,xyz,efg combination, then remove such rows from the final_data table.

I have thought of several approaches and finally arrived at the code below. Is this the best possible solution or is there any other way I can achieve this without having to use a subquery and case statements? The number of records in the raw1_data table is in millions while the number of records in the mapping_data table is a few hundred or thousands.

Code

with mapping_cte as (
    SELECT
        rd.country,
        rd.category,
        rd.brand,
        md.dataset,
        md.country as country1,
        md.category as category1,
        md.brand as brand1,
        md.new_category,
        md.new_brand,
        CASE
            WHEN md.country IS NULL THEN STRUCT(
                rd.category AS mapped_category,
                rd.brand AS mapped_brand,
                'Unmapped' AS status
            )
            WHEN md.country IS NOT NULL
            AND (
                new_category IS NULL
                OR new_brand IS NULL
            ) THEN STRUCT(
                NULL AS mapped_category,
                NULL AS mapped_brand,
                'remove' as status
            )
            ELSE STRUCT(
                new_category AS mapped_category,
                new_brand AS mapped_brand,
                'Mapped' AS status
            )
        END as fields
    FROM
        raw1_data rd
        LEFT OUTER JOIN mapping_data md ON rd.country = md.country
        AND rd.category = md.category
        AND rd.brand = md.brand
        AND md.dataset = 'raw1'
)
select
    mapping_cte.country,
    mapping_cte.fields.mapped_category,
    mapping_cte.fields.mapped_brand,
    mapping_cte.fields.status
from
    mapping_cte
where
    mapping_cte.fields.status != 'remove'
order by
    1,
    2,
    3

Upvotes: 1

Views: 623

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172984

Consider below approach

select country, 
  (case status_flag
    when true then struct(new_category as category, new_brand as brand, 'Mapped' as status)
    else struct(category, brand, 'Unmapped' as status)
  end).*
from (
  select country, category, brand, new_category, new_brand, 
    not m.country is null as status_flag
  from raw_data r
  left join (select * from mapping_data where dataset = 'raw1') m
  using(country, category, brand) 
)
where not (status_flag and new_category is null and new_brand is null)        

if applied to sample data in your question - output is

enter image description here

Upvotes: 2

Related Questions