Ed Moon Little
Ed Moon Little

Reputation: 87

Using regexp in Big Query to extract URLs

I've been trying to extract any URL present within my 'Text' column in Big Query. The column contains a mixture of text and URLs dotted throughout (a cell might contain more than one URL) I'm trying to use this regexp:

SELECT

  REGEXP_EXTRACT  (Text, r'(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*')
FROM
Data.Text_Files

I currently get 'failed to parse regular expression' when I try to run the query. I've tried modifying it but to no avail.

The regexp works in an online builder but I'm just not sure how to incorporate it into Big Query.

Any help would be much appreciated - or at least pointers on how to incorporate regular expressions into Big Query!

Upvotes: 0

Views: 4614

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

Try below - it is for BigQuery Standard SQL (see Enabling Standard SQL and Migrating from legacy SQL)

WITH YourTable AS (
  SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask).  ' AS Text UNION ALL
  SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
  SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text 
)
SELECT 
 id, 
 REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
FROM YourTable

This gives you output with id field, and repeated field with all respective URLs

If you need flattened result - you can use below variation

WITH YourTable AS (
  SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask).  ' AS Text UNION ALL
  SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
  SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text 
)
SELECT
  id, URL    
FROM (
  SELECT id, REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
  FROM YourTable
), UNNEST(URL) as URL

Note: you can use here any regexp that you will be able to find on web - but what a must is - there is only one matching group is allowed! so all inner matching group should be escaped with ?: as you can see it in above examples. So the ONLY group that you expect to see in output should be left as is - w/o ?:

Upvotes: 3

Addison
Addison

Reputation: 8347

Your regex has an incomplete capturing group, and has 2 unescaped characters. I don't know which online regex builder you're using, but maybe you forgot to put your new regex into it?

The problems are as follows:

(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*
POINTERS TO PROBLEMS ON THIS LINE --->                             ^1                    ^^2
  1. This is the start of a capturing group with no end. You probably want the ) right before the *.
  2. All slashes need to be escaped. This should probably be \/ or maybe even \/\\.

Here is an example with both of my suggestions implemented: https://regex101.com/r/pt1hqS/1

Good luck fixing it!

Upvotes: 2

Related Questions