Reputation: 75
I have an example string data as below.
+----------------------------+
| Items |
|----------------------------|
| A >> B >> C |
| A >> A >> B |
| B >> C >> B >> B >> A |
| B >> B >> C >> C >> A |
| B >> A >> B >> C >> A >> D |
+----------------------------+
Take the "Items" in the first row as an example, this "Items" contains three elements, namely A, B and C. I want to calculate the number of each element in order. Besides, I only focus on 3 elements so I ignore 'C >> A >> D' in the final "Items".
Here's what I want:
+----------------------------+---------+----------+---------+----------+---------+----------+
| Items | Item1 | I1_Count | Item2 | I2_Count | Item3 | I3_Count |
|----------------------------|---------|----------|---------|----------|---------|----------|
| A >> B >> C | A | 1 | B | 1 | C | 1 |
| A >> A >> B | A | 2 | B | 1 | NULL | NULL |
| B >> C >> B >> B >> A | B | 1 | C | 1 | B | 2 |
| B >> B >> C >> C >> A | B | 2 | C | 2 | A | 1 |
| B >> A >> B >> C >> A >> D | B | 1 | A | 1 | B | 1 |
+----------------------------+---------+----------+---------+----------+---------+----------+
How can I implement it in SQL(I use Google Bigquery)?
Thanks a million!
Upvotes: 0
Views: 309
Reputation: 172993
Below is for the less likely case when your A, B, C, D are actually finite set of letters or words that you know in advance (BigQuery Standard SQL):
#standardSQL
select items,
left(arr[offset(0)], 1) as Item1,
length(arr[offset(0)]) as I1_Count,
left(arr[safe_offset(1)], 1) as Item2,
length(arr[safe_offset(1)]) as I2_Count,
left(arr[safe_offset(2)], 1) as Item3,
length(arr[safe_offset(2)]) as I3_Count
from `project.dataset.table`,
unnest([struct(regexp_extract_all(replace(Items, ' >> ', ''), r'(A+|B+|C+|D+)') as arr)])
if applied to sample data from your question - output is
Moreover, just adding more lines will cover more sequences
#standardSQL
select items,
left(arr[offset(0)], 1) as Item1,
length(arr[offset(0)]) as I1_Count,
left(arr[safe_offset(1)], 1) as Item2,
length(arr[safe_offset(1)]) as I2_Count,
left(arr[safe_offset(2)], 1) as Item3,
length(arr[safe_offset(2)]) as I3_Count,
left(arr[safe_offset(3)], 1) as Item4,
length(arr[safe_offset(3)]) as I4_Count,
left(arr[safe_offset(4)], 1) as Item5,
length(arr[safe_offset(4)]) as I5_Count,
left(arr[safe_offset(5)], 1) as Item6,
length(arr[safe_offset(5)]) as I6_Count
from `project.dataset.table`,
unnest([struct(regexp_extract_all(replace(Items, ' >> ', ''), r'(A+|B+|C+|D+)') as arr)])
with output
Upvotes: 1
Reputation: 172993
Below is for BigQuery Standard SQL (and is just simple adjustment of the solution for your previous question)
#standardSQL
select items,
arr[offset(0)].item as Item1,
arr[offset(0)].cnt as I1_Count,
arr[safe_offset(1)].item as Item2,
arr[safe_offset(1)].cnt as I2_Count,
arr[safe_offset(2)].item as Item3,
arr[safe_offset(2)].cnt as I3_Count
from (
select items,
array(select as struct item, count(1) cnt
from unnest(
array(select as struct item, countif(not flag) over(order by offset) grp
from unnest(array(
select as struct item, offset, item = lag(item) over(order by offset) flag
from unnest(split(Items, ' >> ')) item with offset
)))
)
group by item, grp
order by grp
) arr
from `project.dataset.table`
)
if to apply to sample data from your question - output is
Upvotes: 1