data princess
data princess

Reputation: 1158

BigQuery Regex_Replace first instance of duplicated substring

I know there are a lot of questions about lazy regex matching, but none of the solutions I've seen have worked. Here's the issue:

Some addresses in my BigQuery results look like this:

www.example.comwww.example.com/path/to/page
apply.example.comapply.example.com/eapp/

I want to strip off the duplicated part to get

www.example.com/path/to/page
apply.example.com/eapp/

I've tried using REGEX_REPLACE() like this:

REGEXP_REPLACE(raw.query.tidy_Landing, r'(.*?)\.com','') AS Landing_Page

But that still finds both matches and so returns

/path/to/page
/eapp/

What's wrong with my regex?

Upvotes: 0

Views: 1567

Answers (2)

Elliott Brossard
Elliott Brossard

Reputation: 33705

I wanted to see if it's possible to do this without a regex, although it ends up being a bit verbose :) This answer assumes that the address is always duplicated and ends with .com. Assuming that's the case, it should be possible to use SPLIT to extract the parts that you are interested in:

SELECT
  CONCAT(
    SPLIT(text, '.com')[OFFSET(0)],
    '.com',
    SPLIT(text, '.com')[OFFSET(2)]
  ) AS Landing_Page
FROM (
  SELECT 'www.example.comwww.example.com/path/to/page' AS text UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/'
);

If you want the query to be tolerant of non-duplicated addresses, you can make a slight modification:

SELECT
  (
    SELECT 
      CONCAT(
        parts[OFFSET(0)],
        '.com',
        parts[OFFSET(ARRAY_LENGTH(parts) - 1)]
      )
    FROM (SELECT SPLIT(text, '.com') AS parts)
  ) AS Landing_Page
FROM (
  SELECT 'www.example.comwww.example.com/path/to/page' AS text UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/' UNION ALL
  SELECT 'www.example.com/path/to/page'
);

Going one step further, you can extract the logic into a UDF:

CREATE TEMP FUNCTION GetLandingPage(text STRING) AS (
  (
    SELECT 
      CONCAT(
        parts[OFFSET(0)],
        '.com',
        parts[OFFSET(ARRAY_LENGTH(parts) - 1)]
       )
    FROM (SELECT SPLIT(text, '.com') AS parts)
  )
);

SELECT
  GetLandingPage(text) AS Landing_Page
FROM (
  SELECT 'www.example.comwww.example.com/path/to/page' AS text UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/' UNION ALL
  SELECT 'www.example.com/path/to/page'
);

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

#standardSQL
WITH t AS (
  SELECT 'www.example.comwww.example.com/path/to/page' str UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/'
)
SELECT str, REGEXP_REPLACE(str, r'^(.*?\.com)', '') fix
FROM t

output is

str                                             fix  
www.example.comwww.example.com/path/to/page     www.example.com/path/to/page  
apply.example.comapply.example.com/eapp/        apply.example.com/eapp/   

Upvotes: 2

Related Questions