Reputation:
Using the GDELT public database in Google query, I am trying to find the top themes associated with Israeli Prime Minister Benjamin Netanyahu around March 3, 2015.
I used the following SQL query
SELECT theme, COUNT(*) as count
FROM (
select REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*', '') theme
from `gdelt-bq.gdeltv2.gkg`
where DATE>20150302000000 and DATE < 20150304000000 and V2Persons like '%Netanyahu%'
)
group by theme
ORDER BY 2 DESC
LIMIT 300
The split is necessary because the V2Themes column uses nested listings. I then want to remove the character offset. This should give me the following:
Row theme count
1 GENERAL_GOVERNMENT 33677
2 LEADER 33405
3 TAX_FNCACT_MINISTER 31174
4 .... ...
But I get an error instead:
No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING.
I understand that SPLIT() creates an array of strings, but I do not know how to resolve this issue.
Is there another function I should use, or can this problem be resolved in another way?
*EDIT
The query works when it is run with Legacy SQL (also need to change the quotation marks to squared brackets). How can I achieve the same using standard SQL?
Upvotes: 2
Views: 4176
Reputation: 172974
Below is for BigQuery Standard SQL
#standardSQL
SELECT theme, COUNT(*) AS COUNT
FROM (
SELECT REGEXP_REPLACE(value, r',.*', '') theme
FROM `gdelt-bq.gdeltv2.gkg`, UNNEST(SPLIT(V2Themes,';')) value
WHERE DATE>20150302000000 AND DATE < 20150304000000
AND V2Persons LIKE '%Netanyahu%'
)
GROUP BY theme
ORDER BY 2 DESC
LIMIT 300
Be aware of following: when you run query and it fails - the cost for you is zero (in most common cases).
This is not a case when your query is correct (as above one in this answer) - in such cases query will successfully complete and you will be billed as per
in Classic UI:
or in BigQuery Console:
Upvotes: 1