Ilja
Ilja

Reputation: 631

Regexp in BigQuery

how can I search in bigquery for expressions and group them even if they are messed up by semicolons?

Database example:

:Adidas
Adidas
Adidas;
null
adidas
7up
7UP
7UP;
:7UP
null

I'd like to group them and count. I'd like to get this result:

adidas 4
7up    4
null   2

By now the goup by does not help and I do the job in Excel, which is everything but fast.

Upvotes: 2

Views: 1973

Answers (1)

Pentium10
Pentium10

Reputation: 207838

First you need to normalize the text to retain only valid words. The below regular expression is just a simple one, you need to match and extend to your logic.

SELECT normalized,
       count(1) AS c
FROM
  (SELECT label,
          lower(REGEXP_EXTRACT(label,r'[[:punct:]]?([[:^punct:]]*)')) AS normalized
   FROM
     (SELECT string(':Adidas') AS label),
     (SELECT string('Adidas') AS label),
     (SELECT string('Adidas;') AS label),
     (SELECT string(NULL) AS label),
     (SELECT string('adidas') AS label),
     (SELECT string('7up') AS label),
     (SELECT string('7UP') AS label),
     (SELECT string('7UP;') AS label),
     (SELECT string(':7UP') AS label),
     (SELECT string(NULL) AS label),)
GROUP BY normalized
ORDER BY c DESC

this outputs:

+-----+------------+---+---+
| Row | normalized | c |   |
+-----+------------+---+---+
|   1 | adidas     | 4 |   |
|   2 | 7up        | 4 |   |
|   3 | null       | 2 |   |
+-----+------------+---+---+

Upvotes: 4

Related Questions