JohnsonLee
JohnsonLee

Reputation: 75

Big query SQL - Count elements from a string but in order

I have an example string data as below.

+----------------------------+
|          Items             |
|----------------------------|
| A >> B >> C                |
| A >> A >> B                |
| B >> C >> B >> B >> A      |
| B >> B >> C >> C >> A      |
| B >> A >> B >> C >> A >> D |
+----------------------------+

Take the "Items" in the first row as an example, this "Items" contains three elements, namely A, B and C. I want to calculate the number of each element in order. Besides, I only focus on 3 elements so I ignore 'C >> A >> D' in the final "Items".

Here's what I want:

+----------------------------+---------+----------+---------+----------+---------+----------+
|          Items             |  Item1  | I1_Count |  Item2  | I2_Count |  Item3  | I3_Count |
|----------------------------|---------|----------|---------|----------|---------|----------|
| A >> B >> C                |    A    |     1    |    B    |     1    |    C    |     1    | 
| A >> A >> B                |    A    |     2    |    B    |     1    |   NULL  |    NULL  |
| B >> C >> B >> B >> A      |    B    |     1    |    C    |     1    |    B    |     2    |
| B >> B >> C >> C >> A      |    B    |     2    |    C    |     2    |    A    |     1    |
| B >> A >> B >> C >> A >> D |    B    |     1    |    A    |     1    |    B    |     1    |
+----------------------------+---------+----------+---------+----------+---------+----------+

How can I implement it in SQL(I use Google Bigquery)?

Thanks a million!

Upvotes: 0

Views: 309

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for the less likely case when your A, B, C, D are actually finite set of letters or words that you know in advance (BigQuery Standard SQL):

#standardSQL
select items,   
  left(arr[offset(0)], 1) as Item1,
  length(arr[offset(0)]) as I1_Count,
  left(arr[safe_offset(1)], 1) as Item2,
  length(arr[safe_offset(1)]) as I2_Count,
  left(arr[safe_offset(2)], 1) as Item3,
  length(arr[safe_offset(2)]) as I3_Count
from `project.dataset.table`,
unnest([struct(regexp_extract_all(replace(Items, ' >> ', ''), r'(A+|B+|C+|D+)') as arr)])     

if applied to sample data from your question - output is

enter image description here

Moreover, just adding more lines will cover more sequences

#standardSQL
select items,   
  left(arr[offset(0)], 1) as Item1,
  length(arr[offset(0)]) as I1_Count,
  left(arr[safe_offset(1)], 1) as Item2,
  length(arr[safe_offset(1)]) as I2_Count,
  left(arr[safe_offset(2)], 1) as Item3,
  length(arr[safe_offset(2)]) as I3_Count,
  left(arr[safe_offset(3)], 1) as Item4,
  length(arr[safe_offset(3)]) as I4_Count,
  left(arr[safe_offset(4)], 1) as Item5,
  length(arr[safe_offset(4)]) as I5_Count,
  left(arr[safe_offset(5)], 1) as Item6,
  length(arr[safe_offset(5)]) as I6_Count
from `project.dataset.table`,
unnest([struct(regexp_extract_all(replace(Items, ' >> ', ''), r'(A+|B+|C+|D+)') as arr)])             

with output

enter image description here

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL (and is just simple adjustment of the solution for your previous question)

#standardSQL
select items,
  arr[offset(0)].item as Item1,
  arr[offset(0)].cnt as I1_Count,
  arr[safe_offset(1)].item as Item2,
  arr[safe_offset(1)].cnt as I2_Count,
  arr[safe_offset(2)].item as Item3,
  arr[safe_offset(2)].cnt as I3_Count
from (
  select items, 
    array(select as struct item, count(1) cnt
    from unnest(
    array(select as struct item, countif(not flag) over(order by offset) grp
    from unnest(array(
        select as struct item, offset, item = lag(item) over(order by offset) flag
        from unnest(split(Items, ' >> ')) item with offset
      )))
    ) 
    group by item, grp
    order by grp
    ) arr  
  from `project.dataset.table`
)    

if to apply to sample data from your question - output is

enter image description here

Upvotes: 1

Related Questions