Reputation: 29
It's my first post in here, so I will try to keep it simple and easy to read.
In Postgres
I can look for the number of times a string contains another string, in my example, I have a list with sentiments and I'm trying to look for how many times they are mentioned in a forum post title which sits in another table. In Postgres
I usually do this with LEFT Join ON
string from table1 LIKE string from table2
. When I tried to do it in Bigquery
it doesn't work and I tried with LIKE
, CONTAINS
and REGEXP_MATCH
. These would be the type of query I would like to do.
SELECT sentiment_type, count(*) FROM
((SELECT forum, page_link, post_title, user, posted_date, content
FROM [sandbox:Forum_data.forum_table]) t1
JOIN [sandbox:Taxonomies.sentiment_taxonomy] t2
ON lower(post_title) LIKE '%'||lower(sentiment)||'%') a
GROUP BY sentiment_type, sentiment
ORDER BY 3 desc
Do you know if this is possible at all in Bigquery
?
Thank you.
Upvotes: 1
Views: 8260
Reputation: 29
Thank you very much guys, indeed I had to use Standard SQL but it doesn't work with large datasets so one of my colleagues helped me find a solution by splitting the string into words. The end result is not perfect but functional and it looks like this:
SELECT sentiment_type, SUM(vol) vol FROM
(
SELECT sentiment_type, a.vol vol, a.id id
FROM (
SELECT *
FROM (SELECT SPLIT(content,' ') AS content_ind
,Vol
,ID
FROM (SELECT content
,Vol
,ROW_NUMBER() OVER (ORDER BY content) ID
FROM (SELECT LOWER(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(content,'%',' '),'!',' '),'"',' '),'?',' '),'.',' '),',',' '),'/',' '),"'",' '),'-',' '),';',' '),':',' '),'(',' '),')',' '),'*',' '),'_',' '),'&',' '),'=',' '),'@',' ')) content
,COUNT(content) vol
FROM [sandbox:Forum_data.Expat_forum_data]
GROUP BY content)
ORDER BY 3)
)
GROUP BY 1,2,3) a JOIN
(SELECT LTRIM(RTRIM(sentiment)) sentiment, sentiment_type FROM [sandbox:Taxonomies.sentiment_taxonomy] WHERE LTRIM(RTRIM(sentiment)) NOT LIKE '% %') b
ON a.content_ind = b.sentiment
GROUP BY 1,2,3)
GROUP BY 1
ORDER BY 2 DESC
Upvotes: 1
Reputation: 173210
Below should work in Legacy SQL
SELECT sentiment_type, sentiment, COUNT(*) as volume
FROM [sandbox:Forum_data.forum_table] t1
CROSS JOIN [sandbox:Taxonomies.sentiment_taxonomy] t2
WHERE LOWER(post_title) LIKE '%' + LOWER(sentiment) + '%'
GROUP BY sentiment_type, sentiment
ORDER BY 3 desc
Upvotes: 0
Reputation: 33765
The query in standard SQL would be something like this:
SELECT sentiment_type, count(*)
FROM (
SELECT forum, page_link, post_title, user, posted_date, content
FROM `sandbox.Forum_data.forum_table`) t1
JOIN `sandbox.Taxonomies.sentiment_taxonomy` t2
ON lower(post_title) LIKE CONCAT('%', lower(sentiment), '%')
GROUP BY sentiment_type, sentiment
ORDER BY 3 desc;
Practically speaking a JOIN
or LEFT JOIN
isn't very scalable without an equality condition. Depending on the size of your tables, standard SQL may force you to include an equality as well, e.g. ON t1.key = t2.key AND lower(post_title) LIKE CONCAT('%', lower(sentiment), '%')
(assuming that such a key exists).
Upvotes: 1