Jonas Palačionis
Jonas Palačionis

Reputation: 4842

Sum of distinct values after grouping explodes a metric

I am using

with t1 as 
(
SELECT
    DATE_TRUNC(PARSE_DATE("%Y%m%d", date), MONTH) as month,
    fullVisitorId,
    product.productSKU,
    product.v2ProductName,
    case when hits.ecommerceaction.action_type = '2' then 1 else 0 end as pdp_visitor,
    count(case when hits.ecommerceaction.action_type = '2' then fullvisitorid else null end) AS views_pdp,
    count(case when hits.ecommerceaction.action_type = '3' then fullvisitorid else null end) AS add_cart,
    count(case when hits.ecommerceaction.action_type = '6' then hits.transaction.transactionid else null end) AS conversions,
    count(distinct(hits.transaction.transactionId)) as transaction_id_cnt,
FROM `table` AS nr, 
    UNNEST(hits) hits,
    UNNEST(product) product
GROUP BY 1,2,3,4,5
)
select 
    month,
    sum(views_pdp) as pdp 
    ,sum(add_cart) as add_cart
    ,sum(conversions) as conversions
    ,sum(transaction_id_cnt)
from t1
group by 1
order by 1 desc;

Which returns

month       pdp     add_cart    conversions     f0_ 
2021-02-01  500     100         20              10  
2021-01-01  600     200         30              20

I know that f0_ ( count(distinct(hits.transaction.transactionId)) ) is bad here because of product.productSKU and product.v2ProductName grouping. In general, when user makes an order with 3 items in his basket, I want to count this as one order, whereas now it is counted as 3.

This count(distinct(hits.transaction.transactionId)) as transaction_id_cnt results in the correct output if I comment out product.productSKU and product.v2ProductName.

Running this query:

with t1 as 
(
SELECT
    DATE_TRUNC(PARSE_DATE("%Y%m%d", date), MONTH) as month,
    fullVisitorId,
    -- product.productSKU, # commented out
    -- product.v2ProductName, # commented out
    case when hits.ecommerceaction.action_type = '2' then 1 else 0 end as pdp_visitor,
    count(case when hits.ecommerceaction.action_type = '2' then fullvisitorid else null end) AS views_pdp,
    count(case when hits.ecommerceaction.action_type = '3' then fullvisitorid else null end) AS add_cart,
    count(case when hits.ecommerceaction.action_type = '6' then hits.transaction.transactionid else null end) AS conversions,
    count(distinct(hits.transaction.transactionId)) as transaction_id_cnt,
FROM `table` AS nr, 
    UNNEST(hits) hits,
    UNNEST(product) product
GROUP BY 1,2,3,4,5
)
select 
    month,
    sum(views_pdp) as pdp 
    ,sum(add_cart) as add_cart
    ,sum(conversions) as conversions
    ,sum(transaction_id_cnt)
from t1
group by 1
order by 1 desc;

Returns what is expected, but now I don't have productSKU and v2ProductName which I need. I suspect that the problem is that each order is a new line in google big query and when I ask to to select it by product name and SKU, I count the uniques and then sum it.

How can I achieve the correct summation of count(distinct(hits.transaction.transactionId)) without losing the grouping by product.productSKU and product.v2ProductName which explodes this metric?

Upvotes: 0

Views: 442

Answers (1)

Daniel Aranda
Daniel Aranda

Reputation: 6552

On the group by Query you could cherry pick them as array(so you don't group by them):

  ARRAY_AGG(DISTINCT product.productSKU IGNORE NULLS) AS productSKU_list,
  ARRAY_AGG(DISTINCT product.v2ProductName IGNORE NULLS) AS productName_list,

Update per your below comment: If you want to use them in further group by just save them as string instead of array.

  STRING_AGG(DISTINCT product.productSKU, ',') AS productSKU_list,
  STRING_AGG(DISTINCT product.v2ProductName, ',') AS productName_list,

Upvotes: 1

Related Questions