Reputation: 314
I have a big-query schema such as this:
visitorId INTEGER NULLABLE
visitID INTEGER NULLABLE
hits RECORD REPEATED
hits.eventInfo RECORD NULLABLE
hits.eventInfo.eventCategory STRING NULLABLE
hits.eventInfo.eventLabel STRING NULLABLE
with sample data as:
visitorId visitId hits.eventInfo.eventCategory hits.eventInfo.eventCategory
123456 1 abc {"info":"secret", "otherfields":"blah"}
lmn {"info":"secret", "otherfields":"blah"}
xyz {"info":"secret", "otherfields":"blah"}
124557 1 abc {"info":"secret", "otherfields":"blah"}
lmn {"info":"secret", "otherfields":"blah"}
xyz {"info":"secret", "otherfields":"blah"}
I need to remove "info":"secret", only when the eventCategory is "abc".
I am a big-query newbie. After much hitting and trying I was able to come to this, but unfortunately stuck now.
UPDATE `project.dataset.ga_sessions_20200608`
SET hits = ARRAY(
SELECT AS STRUCT * REPLACE((REGEXP_REPLACE(eventInfo.eventLabel, r"\"info\":\"[a-z A-Z]*\",", "")) AS eventInfo.eventLabel) from UNNEST(hits)
)
WHERE (select eventInfo.eventLabel from UNNEST(hits)) LIKE '%info%'
There are two problems here.
Any help, pointers will be appreciated.
Upvotes: 2
Views: 3466
Reputation: 10257
It's challenging to do this with DML approach (UPDATE
) as opposed to a batch process that operates over the whole table for a few reasons (some of which you already gave):
This puts you in the position of having to reconstruct the entire struct of both hits
and eventInfo
in order to replace it's one field (which is itself a JSON encoded struct, although BigQuery is sort of blind to this and treats it as a string).
AFAIK BigQuery does not have a function that marshalls a true STRUCT from json, so regex might be the only way to actually eliminate the undesired field in the json-encoded column.
The WHERE condition here can make use of the JSON_EXTRACT_SCALAR
standardSQL function to capture the condition that the particular key to be removed actually exists in the json.
So a possible pure BigQuery approach (no UDFs) could look like:
#standardSQL
UPDATE
`project.dataset.table_DATE`
SET
--reconstruct hits column as an array
hits = ARRAY(
SELECT
--reconstruct each struct of hits
AS STRUCT * REPLACE( (
SELECT
--reconstruct eventInfo...
AS STRUCT eventInfo.* REPLACE(
--with eventLabel replaced with a filtered version
REGEXP_REPLACE(eventInfo.eventLabel, r"\"secret\":\"[A-Za-z\p{L}]* [A-Za-z\p{L}]*\",", "") AS eventLabel)
) AS eventInfo )
FROM
UNNEST(hits) AS hits )
WHERE
--Only for rows where at least one eventLabel's json contains a `secret` key
EXISTS (SELECT JSON_EXTRACT_SCALAR(h.eventInfo.eventLabel, "$.secret") is not null from unnest(hits) as h)
Note the regex there may need some changes based on the format of the data. Here I'm assuming a two word secret value with possible unicode characters (like a first + last name).
The regex part might be better served by a UDF, although those can be slow for large table.
Upvotes: 2