Reputation: 43
I have two tables in Bigquery - raw1_data and mapping_data. The intention is to use the mapping_data table to map the products from the raw1_data table to produce the final_data table.
mapping_data
dataset | country | category | brand | new_category | new_brand |
---|---|---|---|---|---|
raw1 | A | abc | pqr | bsq | pqr |
raw1 | B | xyz | efg | ||
raw1 | X | abc | lmn | bsq | lmn |
raw1 | Y | xyz | hij | xyz | hij |
raw2 | X | abc | lmn | abc | lmn |
raw3 | B | xyz | efg | xyz | efg |
raw_data
country | category | brand |
---|---|---|
A | abc | efg |
B | xyz | efg |
X | abc | hij |
X | abc | lmn |
Y | xyz | hij |
final_data
country | category | brand | status |
---|---|---|---|
A | abc | efg | Unmapped |
X | abc | hij | Unmapped |
X | bsq | lmn | Mapped |
Y | xyz | hij | Mapped |
The rules of mapping are -
I have thought of several approaches and finally arrived at the code below. Is this the best possible solution or is there any other way I can achieve this without having to use a subquery and case statements? The number of records in the raw1_data table is in millions while the number of records in the mapping_data table is a few hundred or thousands.
Code
with mapping_cte as (
SELECT
rd.country,
rd.category,
rd.brand,
md.dataset,
md.country as country1,
md.category as category1,
md.brand as brand1,
md.new_category,
md.new_brand,
CASE
WHEN md.country IS NULL THEN STRUCT(
rd.category AS mapped_category,
rd.brand AS mapped_brand,
'Unmapped' AS status
)
WHEN md.country IS NOT NULL
AND (
new_category IS NULL
OR new_brand IS NULL
) THEN STRUCT(
NULL AS mapped_category,
NULL AS mapped_brand,
'remove' as status
)
ELSE STRUCT(
new_category AS mapped_category,
new_brand AS mapped_brand,
'Mapped' AS status
)
END as fields
FROM
raw1_data rd
LEFT OUTER JOIN mapping_data md ON rd.country = md.country
AND rd.category = md.category
AND rd.brand = md.brand
AND md.dataset = 'raw1'
)
select
mapping_cte.country,
mapping_cte.fields.mapped_category,
mapping_cte.fields.mapped_brand,
mapping_cte.fields.status
from
mapping_cte
where
mapping_cte.fields.status != 'remove'
order by
1,
2,
3
Upvotes: 1
Views: 623
Reputation: 172984
Consider below approach
select country,
(case status_flag
when true then struct(new_category as category, new_brand as brand, 'Mapped' as status)
else struct(category, brand, 'Unmapped' as status)
end).*
from (
select country, category, brand, new_category, new_brand,
not m.country is null as status_flag
from raw_data r
left join (select * from mapping_data where dataset = 'raw1') m
using(country, category, brand)
)
where not (status_flag and new_category is null and new_brand is null)
if applied to sample data in your question - output is
Upvotes: 2