Reputation: 179
I am looking to extract parts of a string that follow specified letters, for example from the below string:
wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4
I am looking to extract 275 (for wc), 2 (for c1.3) and 25 (for c12.1).
I have tried the following but the fields anew, ridanxietycnt and wordcount just show "NULL".
SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(CAST(REGEXP_EXTRACT(GCAM, r'c1.3:([-d.]+)')AS FLOAT64)) anew,
sum(CAST(REGEXP_EXTRACT(GCAM, r'c12.1:([-d.]+)')AS FLOAT64)) ridanxietycnt,
sum(CAST(REGEXP_EXTRACT(GCAM, r'wc:(d+)')AS FLOAT64)) wordcount
FROM `gdelt-bq.gdeltv2.gkg_partitioned` where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
group by daydate
I would expect to see the aggregated number for each column.
I wonder if the issue is with the regex expression?
Upvotes: 0
Views: 299
Reputation: 7277
You can treat them as simple key/value pairs and then do the aggregation on top of it. Something like:
select
substr(CAST((DATE) AS STRING),0,8) as daydate,
split(x,':')[safe_offset(0)] as key,
cast(split(x,':')[safe_offset(1)] as float64) as value
from `gdelt-bq.gdeltv2.gkg_partitioned`,
unnest(split(GCAM, ',')) as x
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
Hope this helps.
Upvotes: 1
Reputation: 75
This should get you 2 (anew), 25 (ridanxietycnt), 275 (wordcount), in this order
SELECT
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'c1.3:(\d+)') as FLOAT64) anew,
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'c12.1:(\d+)') as FLOAT64) ridanxietycnt,
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'wc:(\d+)') as FLOAT64) wordcount
Upvotes: 1