Reputation: 180
I'm streaming Stackdriver logs into Bigquery, and they end up in a textPayload
field in the following format:
member_id_hashed=123456789,
member_age -> Float(37.0,244),
operations=[92967,93486,86220,92814,92943,93279,...],
scores=[3.214899,2.3641025E-5,2.5823574,2.3818345,3.9919448,0.0,...],
[etc]
I then define a query/view on the table with the raw logging entries as follows:
SELECT
member_id_hashed as member_id, member_age,
split(operations,',') as operation,
split(scores,',') as score
FROM
(
SELECT
REGEXP_EXTRACT(textPayload, r'member_id=([0-9]+)') as member_id_hashed,
REGEXP_EXTRACT(textPayload, r'member_age -> Float\(([0-9]+)') as member_age,
REGEXP_EXTRACT(textPayload, r'operations=\[(.+)') as operations,
REGEXP_EXTRACT(textPayload, r'scores=\[(.+)') as scores
from `myproject.mydataset.mytable`
)
resulting in one row with two single fields and two arrays:
Ideally, for further analysis, I would like the two arrays to be nested (e.g. operation.id and operation.score) or flatten the arrays line by line while keeping the positions (i.e. line 1 of array 1 should appear next to line 1 of array 2, etc):
Can anybody point me to a way to make nested fields out of the arrays, or to flatten the arrays? I tried unnesting and joining, but that would give me too many possible cross-combinations in the result.
Thanks for your help!
Upvotes: 2
Views: 394
Reputation: 33705
You can zip the two arrays like this. It unnests the array with operation IDs and gets the index of each element, then selects the corresponding element of the array with scores. Note that this assumes that the arrays have the same number of elements. If they don't, you could use SAFE_OFFSET
instead of OFFSET
in order to get NULL if there are more IDs than scores, for instance.
SELECT
member_id_hashed, member_age,
ARRAY(
SELECT AS STRUCT id, split(scores,',')[OFFSET(off)] AS score
FROM UNNEST(split(operations,',')) AS id WITH OFFSET off
ORDER BY off
) AS operations
FROM (
SELECT
REGEXP_EXTRACT(textPayload, r'member_id=([0-9]+)') as member_id,
REGEXP_EXTRACT(textPayload, r'member_age -> Float\(([0-9]+)') as member_age,
REGEXP_EXTRACT(textPayload, r'operations=\[(.+)') as operations,
REGEXP_EXTRACT(textPayload, r'scores=\[(.+)') as scores
from `myproject.mydataset.mytable`
)
Upvotes: 2