Reputation: 29

join like bigquery

It's my first post in here, so I will try to keep it simple and easy to read.

In Postgres I can look for the number of times a string contains another string, in my example, I have a list with sentiments and I'm trying to look for how many times they are mentioned in a forum post title which sits in another table. In Postgres I usually do this with LEFT Join ON string from table1 LIKE string from table2. When I tried to do it in Bigquery it doesn't work and I tried with LIKE, CONTAINS and REGEXP_MATCH. These would be the type of query I would like to do.

SELECT sentiment_type, count(*) FROM
((SELECT forum, page_link, post_title, user, posted_date, content
  FROM [sandbox:Forum_data.forum_table]) t1
  JOIN [sandbox:Taxonomies.sentiment_taxonomy] t2
  ON lower(post_title) LIKE '%'||lower(sentiment)||'%') a
GROUP BY sentiment_type, sentiment
ORDER BY 3 desc

Do you know if this is possible at all in Bigquery?

Thank you.

Upvotes: 1

Answers (3)

Alex Andronic

Reputation: 29

Thank you very much guys, indeed I had to use Standard SQL but it doesn't work with large datasets so one of my colleagues helped me find a solution by splitting the string into words. The end result is not perfect but functional and it looks like this:

SELECT sentiment_type, SUM(vol) vol FROM
(
SELECT sentiment_type, a.vol vol, a.id id
    FROM (
      SELECT *
        FROM (SELECT SPLIT(content,' ') AS content_ind
             ,Vol
            ,ID
        FROM (SELECT content
                    ,Vol
                    ,ROW_NUMBER() OVER (ORDER BY content) ID
                FROM (SELECT LOWER(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(content,'%',' '),'!',' '),'"',' '),'?',' '),'.',' '),',',' '),'/',' '),"'",' '),'-',' '),';',' '),':',' '),'(',' '),')',' '),'*',' '),'_',' '),'&',' '),'=',' '),'@',' ')) content
                            ,COUNT(content) vol
                        FROM [sandbox:Forum_data.Expat_forum_data]
                    GROUP BY content)
            ORDER BY 3)
      )     
GROUP BY 1,2,3) a JOIN 
(SELECT LTRIM(RTRIM(sentiment)) sentiment, sentiment_type FROM     [sandbox:Taxonomies.sentiment_taxonomy] WHERE LTRIM(RTRIM(sentiment)) NOT LIKE '% %') b
ON a.content_ind = b.sentiment
GROUP BY 1,2,3)
GROUP BY 1
ORDER BY 2 DESC

Upvotes: 1

Mikhail Berlyant

Reputation: 173210

Below should work in Legacy SQL

SELECT sentiment_type, sentiment, COUNT(*) as volume
FROM [sandbox:Forum_data.forum_table] t1
CROSS JOIN [sandbox:Taxonomies.sentiment_taxonomy] t2
WHERE LOWER(post_title) LIKE '%' + LOWER(sentiment) + '%'
GROUP BY sentiment_type, sentiment
ORDER BY 3 desc

Upvotes: 0

Elliott Brossard

Reputation: 33765

The query in standard SQL would be something like this:

SELECT sentiment_type, count(*)
FROM (
  SELECT forum, page_link, post_title, user, posted_date, content
  FROM `sandbox.Forum_data.forum_table`) t1
JOIN `sandbox.Taxonomies.sentiment_taxonomy` t2
ON lower(post_title) LIKE CONCAT('%', lower(sentiment), '%')
GROUP BY sentiment_type, sentiment
ORDER BY 3 desc;

Practically speaking a JOIN or LEFT JOIN isn't very scalable without an equality condition. Depending on the size of your tables, standard SQL may force you to include an equality as well, e.g. ON t1.key = t2.key AND lower(post_title) LIKE CONCAT('%', lower(sentiment), '%') (assuming that such a key exists).

Upvotes: 1

join like bigquery

Answers (3)

Related Questions