Bendegúz Tunyogi
Bendegúz Tunyogi

Reputation: 33

BigQuery - Using regexp with LIKE operator (?)

I'd like to get productids from url and I've almost finetuned a query to do it but still there is an issue I cannot solve. The url usually looks like this:

/xp-pen/toll-spe43-deco-pro-small-medium-spe43-tobuy-p665088831/

or

/harry-potter-es-a-tuz-serlege-2019-m19247107/

As you can see there are two types of ids:

I created this case when statement:

CASE
   WHEN MAX(hits.page.pagePath) LIKE '%-p%' 
     THEN MAX(REGEXP_REPLACE(REGEXP_EXTRACT(                
       hits.page.pagePath, '-p[0-9]+/'), '\\-|p|/', ''))                                
   WHEN MAX(hits.page.pagePath) LIKE '%-m%' 
     THEN MAX(REGEXP_REPLACE(REGEXP_EXTRACT(                
       hits.page.pagePath, '-m[0-9]+/'), '\\-|m|/', ''))
   ELSE NULL            
END AS productId

It's a little complicated at the first look but I really needed a regexp_replace and a regexp_extract because '-p' or '-m' characters doesn't appear only before the id but it can be multiplied times in a url.

The problem with my code is that there are some special cases when the url looks like this:

/elveszett-profeciak-2019-m17855487/

As you can see the id starts with '-m' but the url also contains '-p'. In this case the result is empty value in the query. I think it could be solved by modifying the like operator in the when part of the case when statement: LIKE '%-p%' or LIKE '%-m%' It would be great to have a regexp expression after or instead of the LIKE operator. Something similar to the parameter of '-p[0-9]+/' what I used in regexp_extract function.

So what I would need is to define in the when part of the statement that if the '-p' or '-m' text is followed by numbers in the urls

I'm not sure it's possible to do or not in BQ.

Upvotes: 0

Views: 1285

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269793

So what I would need is to define in the when part of the statement that if the '-p' or '-m' text is followed by numbers in the urls

I think you want '-p' and '-m' followed by digits. If so, I think this does what you want:

select regexp_extract(url, '-[pm][0-9]+')
from (select '/xp-pen/toll-spe43-deco-pro-small-medium-spe43-tobuy-p665088831/' as url union all 
      select '/elveszett-profeciak-2019-m17855487/' union all
      select '/harry-potter-es-a-tuz-serlege-2019-m19247107/'
     ) x

Upvotes: 1

Related Questions