Reputation: 631
how can I search in bigquery for expressions and group them even if they are messed up by semicolons?
Database example:
:Adidas
Adidas
Adidas;
null
adidas
7up
7UP
7UP;
:7UP
null
I'd like to group them and count. I'd like to get this result:
adidas 4
7up 4
null 2
By now the goup by does not help and I do the job in Excel, which is everything but fast.
Upvotes: 2
Views: 1973
Reputation: 207838
First you need to normalize the text to retain only valid words. The below regular expression is just a simple one, you need to match and extend to your logic.
SELECT normalized,
count(1) AS c
FROM
(SELECT label,
lower(REGEXP_EXTRACT(label,r'[[:punct:]]?([[:^punct:]]*)')) AS normalized
FROM
(SELECT string(':Adidas') AS label),
(SELECT string('Adidas') AS label),
(SELECT string('Adidas;') AS label),
(SELECT string(NULL) AS label),
(SELECT string('adidas') AS label),
(SELECT string('7up') AS label),
(SELECT string('7UP') AS label),
(SELECT string('7UP;') AS label),
(SELECT string(':7UP') AS label),
(SELECT string(NULL) AS label),)
GROUP BY normalized
ORDER BY c DESC
this outputs:
+-----+------------+---+---+
| Row | normalized | c | |
+-----+------------+---+---+
| 1 | adidas | 4 | |
| 2 | 7up | 4 | |
| 3 | null | 2 | |
+-----+------------+---+---+
Upvotes: 4