Reputation: 1325
I have a table in bigquery where I have object and for each object I have some stringified json. In json, an example row look like below:
{
"ObjectID": "1984931229",
"indexed_abstract": "{\"IndexLength\":123,\"InvertedIndex\":{\"Twenty-seven\":[0],\"metastatic\":[1,45],\"breast\":[2],\"adenocarcinoma\":[3],\"patients,\":[4]}}"
}
where inside the indexed_abstract
we have an InvertedIndex
which contains some keywords and how many times these keywords appeared in the ObjectID
.
Now I want to access the stringified json by parsing the json using bigquery and for each ObjectID
I want to create a nested field where I have the keyword, the corresponding array and the length of the corresponding array.
For example in this case the output would look like the following:
+------------+----------------+---------------+-------------------+
| ObjectID | keyword.key | keyword.count | keyword.positions |
+------------+----------------+---------------+-------------------+
| 1984931229 | Twenty-seven | 1 | [0] |
| | metastatic | 2 | [1,45] |
| | breast | 1 | [2] |
| | adenocarcinoma | 1 | [3] |
| | patients | 1 | [4] |
+------------+----------------+---------------+-------------------+
I understand I could use JSON_EXTRACT function but I am not sure what would be my key inside the inverted index to access the keywords and the arrays corresponding to them.
Upvotes: 3
Views: 6468
Reputation: 172944
Below is for BigQuery Standard SQL
#standardSQL
SELECT ObjectID,
ARRAY(
SELECT AS STRUCT
key,
ARRAY_LENGTH(SPLIT(value)) `count`,
value positions
FROM UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(indexed_abstract, '$.InvertedIndex'), r'"[^"]+":\[[\d,]*?]')) pair,
UNNEST([STRUCT(REPLACE(SPLIT(pair, ':')[OFFSET(0)], '"', '') AS key, SPLIT(pair, ':')[OFFSET(1)] AS value)])
) keyword
FROM `project.dataset.table`
If to apply to sample data from your question - result is
Row ObjectID keyword.key keyword.count keyword.positions
1 1984931229 Twenty-seven 1 [0]
metastatic 2 [1,45]
breast 1 [2]
adenocarcinoma 1 [3]
patients 1 [4]
Update on Op's comment - I was wondering if I wanted to make the positions an array (a repeated field), how would I do that?
Change needs to be done in just one line
SPLIT(REGEXP_REPLACE(value, r'\[|]', '')) positions
Upvotes: 4