Hamilton
Hamilton

Reputation: 678

Any efficient way to pivot table and count distinct date in sql / bigquery?

I have transaction data table which is in user level, where user buy different products day to day. I want to aggregate the total transaction count for per user by pivoting the columns with same prefix. meaning, per user, total time of transaction for pivoted columns. it should be conditioned on date as well.

reproducible data and attempt

here is reproducible data schema:

CREATE OR REPLACE TABLE transaction_tble (
  user_id char(10),
  transaction_date, date,
  catg_prod1_amt int(10),
  catg_prod2_amt int(10),
  catg_prod3_amt int(10),
  sku_item1_qty int(10),
  sku_item2_qty int(10),
  sku_item3_qty int(10)
)

here is the value of the table:

here is the sample data:

user_id,transaction_date,sku_sale_prod1,sku_sale_prod2,sku_sale_prod3,catg_purch_prod3,catg_purch_prod4,catg_purch_prod5
A,1/14/2022,10,0,0,0,11,0
B,1/10/2022,0,18,0,5,11,7
A,1/19/2022,6,18,2,5,0,0
A,1/19/2022,10,18,1,5,11,7
B,1/19/2022,10,18,1,5,11,7
C,1/20/2022,10,18,1,5,11,7
C,1/20/2022,10,18,1,5,11,7
B,4/19/2022,10,18,1,5,11,7
A,5/23/2022,10,18,1,5,11,7
A,7/23/2022,10,18,1,5,11,7

my attempt

I think using pivot operation may be right approach but it doesn't work

CREATE OR REPLACE TABLE trans_view (
    select 
        user_id,
        ( case when transaction_date between '2022-01-01' - INTERVAL 6 MONTH then count(distinct transaction_date) end) as 2022_h1_trans_cnt, 
        ( case when transaction_date between '2022-06-01' - INTERVAL 6 MONTH then count(distinct transaction_date) end) as 2022_h2_trans_cnt,

        -- using pivot here
        ( case 
            when transaction_date between '2022-01-01' - INTERVAL 6 MONTH
            and 
            PIVOT (
                count(distinct transaction_date)
                for sku in (
                    'sku_sale_prod1','sku_sale_prod2','sku_sale_prod3'
                )
            )
            then count(distinct transaction_date) 
            end
        ) as 2022_h1_trans_cnt_sku, 
        -- 
        ( case 
            when transaction_date between '2022-01-01' - INTERVAL 6 MONTH
            and 
            PIVOT (
                count(distinct transaction_date)
                for catg in (
                    'catg_purch_prod3','catg_purch_prod4','catg_purch_prod5'
                )
            )
            then count(distinct transaction_date) 
            end
        ) as 2022_h1_trans_cnt_catg
    from transaction_table
    group by user_id 
)

I looked into pivot function in big query, not sure what went wrong with my logic. Can anyone suggest possible workaround to do this?

desired output

here is my desired output table schema:

CREATE OR REPLACE VIEW transc_view as(
select
    user_id,
    2022_h1_trans_cnt,
    2022_h2_trans_cnt,
    2022_h1_trans_cnt_catg,
    2022_h2_trans_cnt_catg,
    2022_h1_trans_cnt_sku,
    2022_h2_trans_cnt_sku
from transaction_tble
group by user_id
)

here is desired value of the output table:

user_id,2022_1st_6mth_trns_cnt,2022_2nd_6mth_trns_cnt,2022_1st_6mth_trns_cnt_sku,2022_2nd_6mth_trns_cnt_sku,2022_1st_6mth_trns_cnt_catg,2022_2nd_6mth_trns_cnt_catg
A,4,1,3,,3,1
B,3,0,3,0,3,0
C,1,1,1,0,1,0

How can I correctly pivot the table and get distinct transaction date count for this? any thoughts or quick workaround on this?

Upvotes: 1

Views: 581

Answers (1)

Jaytiger
Jaytiger

Reputation: 12274

Except 2022_h1_trans_cnt and 2022_h2_trans_cnt in your expected outout, you can consider below.

WITH stacked AS (
  SELECT *, 
        'y' || EXTRACT(YEAR FROM tx_date) || '_' || 
        CASE 
          WHEN EXTRACT(QUARTER FROM tx_date) IN (1, 2) THEN 'h1' -- 1st, 2nd Quarter 
          WHEN EXTRACT(QUARTER FROM tx_date) IN (3, 4) THEN 'h2' -- 3rd, 4th Quarter
        END AS period_name,
        CASE 
          WHEN STARTS_WITH(metric, 'sku_sale_prod') THEN 'trns_cnt_sku' 
          WHEN STARTS_WITH(metric, 'catg_purch_prod') THEN 'trns_cnt_catg'
        END AS metric_name,
    FROM (
      SELECT *, PARSE_DATE('%m/%d/%Y', transaction_date) tx_date
        FROM transaction_table 
     UNPIVOT (value FOR metric IN (
                      catg_purch_prod3, catg_purch_prod4, catg_purch_prod5,
                      sku_sale_prod1, sku_sale_prod2, sku_sale_prod3
                    )
             )
    )
),
aggregation AS (
  SELECT user_id, period_name || '_' || metric_name AS tx_period_metric,
         COUNT(DISTINCT tx_date) cnt
    FROM stacked
   GROUP BY 1, 2
)
SELECT user_id,
       y2022_h1_trns_cnt_sku AS y2022_h1_trns_cnt,
       y2022_h2_trns_cnt_sku AS y2022_h2_trans_cnt,
       * EXCEPT(user_id)
  FROM aggregation
 PIVOT (SUM(cnt) FOR tx_period_metric IN (
                   'y2022_h1_trns_cnt_sku', 'y2022_h2_trns_cnt_sku',
                   'y2022_h1_trns_cnt_catg', 'y2022_h2_trns_cnt_catg'
                 )
       )
;

Query results

enter image description here

  • It looks 2022_h1_trans_cnt and 2022_h2_trans_cnt are same value as that of y2022_h1_trns_cnt_sku, y2022_h2_trns_cnt_sku respectively. Correct me if I'm wrong.

Upvotes: 2

Related Questions