Jonas Palačionis
Jonas Palačionis

Reputation: 4842

Getting sum of distinct values when grouping by an unnested value in google big query

I am querying google big query table which has many rows but the ones I am interested in looks like this:

date             fullVisitorId       hits.product.productSKU     hits.product.v2ProductName     hits.transaction.transactionId

20210427        63546815            MM52AF                      panda                           149816182
20210427        65198162            KGSA5A                      giraffe                         321498182

I am trying to count the total transactions by counting distinct hits.transaction.transactionId.

with t1 as 
(
SELECT
    DATE_TRUNC(PARSE_DATE("%Y%m%d", date), MONTH) as month,
    fullVisitorId,
    product.productSKU as sku,
    product.v2ProductName as v2,
    case when hits.ecommerceaction.action_type = '2' then 1 else 0 end as pdp_visitor,
    count(case when hits.ecommerceaction.action_type = '2' then fullvisitorid else null end) AS views_pdp,
    count(case when hits.ecommerceaction.action_type = '3' then fullvisitorid else null end) AS add_cart,
    count(case when hits.ecommerceaction.action_type = '6' then hits.transaction.transactionid else null end) AS conversions,
    count(distinct(hits.transaction.transactionId)) as transaction_id_cnt,
FROM `table` AS nr, 
    UNNEST(hits) hits,
    UNNEST(product) product
GROUP BY 1,2,3,4,5
)
select 
    month,
    product.productSKU as sku,
    product.v2ProductName as v2,
    sum(views_pdp) as pdp 
    ,sum(add_cart) as add_cart
    ,sum(conversions) as conversions
    ,sum(transaction_id_cnt) as transactions
from t1
group by 1
order by 1 desc;

Which returns:

month               sku          v2      pdp            add_cart    conversions     transactions    
2021-04-01          AHBS         615     10146410       365569      46885           46640
2021-03-01          HERD         154     10074095       399483      58162           57811

But transactions is not correct, I get the correct output using this:


with t1 as 
(
SELECT
    DATE_TRUNC(PARSE_DATE("%Y%m%d", date), MONTH) as month,
    fullVisitorId,
    ARRAY_AGG(DISTINCT product.productSKU IGNORE NULLS) AS productSKU_list, -- changed this
    ARRAY_AGG(DISTINCT product.v2ProductName IGNORE NULLS) AS productName_list, -- changed this
    case when hits.ecommerceaction.action_type = '2' then 1 else 0 end as pdp_visitor,
    0 AS views_impressions,
    count(case when hits.ecommerceaction.action_type = '2' then fullvisitorid else null end) AS views_pdp,
    count(case when hits.ecommerceaction.action_type = '3' then fullvisitorid else null end) AS add_cart,
    0 AS add_shortlist,
    count(case when hits.ecommerceaction.action_type = '5' then fullvisitorid else null end) AS checkouts,
    count(case when hits.ecommerceaction.action_type = '6' then hits.transaction.transactionid else null end) AS conversions,
    count(distinct(hits.transaction.transactionId)) as transaction_id_cnt,
FROM `table` AS nr, 
    UNNEST(hits) hits,
    UNNEST(product) product
GROUP BY 1,2,5
)
select 
    month,
    sum(views_pdp) as pdp 
    ,sum(add_cart) as add_cart
    ,sum(conversions) as conversions
    ,sum(transaction_id_cnt) as transactions
from t1
group by 1
order by 1 desc;

Which returns the correct transactions

month       pdp         add_cart     conversions      transactions  
2021-04-01  9978511     396333       46885            30917 
2021-03-01  15101718    568904       58162            23017

But using this :

...
ARRAY_AGG(DISTINCT product.productSKU IGNORE NULLS) AS productSKU_list, 
ARRAY_AGG(DISTINCT product.v2ProductName IGNORE NULLS) AS productName_list,
...

Does not allow me to group or select productSKU_list and productName_list in my second select statement.

I believe this is because if one order is made with multiple items in the basket there are multiple lines in google big query with the same hits.transaction.transactionId I tried confirming this with:

select distinct(hits.transaction.transactionId), count(distinct hits.transaction.transactionId) as total
FROM `table` AS nr, 
    UNNEST(hits) hits,
    UNNEST(product) product
WHERE _TABLE_SUFFIX between '200101' AND '210428'  
GROUP BY 1
order by 2 desc

But I get:

transactionId   total   
ABSAD54         1   
515ABDG         1

So at this point, I am lost, as I am unsure why I get the correct answer if I use the second script or when I comment out this part from the first query.

 --product.productSKU,
 --product.v2ProductName,

Any tips on how google big query works is accepted.

My goal is to have the correct output of transactions which is achieved in the second script but still be able to group and have values of product.productSKU and product.v2ProductName.

Upvotes: 0

Views: 854

Answers (1)

eshirvana
eshirvana

Reputation: 24603

in your second query , you need to aggregate them again :

select 
    month,
    sum(views_pdp) as pdp 
    ,sum(add_cart) as add_cart
    ,sum(conversions) as conversions
    ,sum(transaction_id_cnt) as transactions
    ,ARRAY_AGG(productSKU_list)
    ,ARRAY_AGG(productName_list)
from t1
group by month
order by month desc;

Upvotes: 0

Related Questions