Reputation: 87
I've been trying to extract any URL present within my 'Text' column in Big Query. The column contains a mixture of text and URLs dotted throughout (a cell might contain more than one URL) I'm trying to use this regexp:
SELECT
REGEXP_EXTRACT (Text, r'(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*')
FROM
Data.Text_Files
I currently get 'failed to parse regular expression' when I try to run the query. I've tried modifying it but to no avail.
The regexp works in an online builder but I'm just not sure how to incorporate it into Big Query.
Any help would be much appreciated - or at least pointers on how to incorporate regular expressions into Big Query!
Upvotes: 0
Views: 4614
Reputation: 172974
Try below - it is for BigQuery Standard SQL (see Enabling Standard SQL and Migrating from legacy SQL)
WITH YourTable AS (
SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask). ' AS Text UNION ALL
SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text
)
SELECT
id,
REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
FROM YourTable
This gives you output with id field, and repeated field with all respective URLs
If you need flattened result - you can use below variation
WITH YourTable AS (
SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask). ' AS Text UNION ALL
SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text
)
SELECT
id, URL
FROM (
SELECT id, REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
FROM YourTable
), UNNEST(URL) as URL
Note: you can use here any regexp that you will be able to find on web - but what a must is - there is only one matching group is allowed! so all inner matching group should be escaped with ?:
as you can see it in above examples. So the ONLY group that you expect to see in output should be left as is - w/o ?:
Upvotes: 3
Reputation: 8347
Your regex has an incomplete capturing group, and has 2 unescaped characters. I don't know which online regex builder you're using, but maybe you forgot to put your new regex into it?
The problems are as follows:
(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*
POINTERS TO PROBLEMS ON THIS LINE ---> ^1 ^^2
)
right before the *
. \/
or maybe even \/\\
.Here is an example with both of my suggestions implemented: https://regex101.com/r/pt1hqS/1
Good luck fixing it!
Upvotes: 2