2bon2b
2bon2b

Reputation: 165

Using regex to extract product name from URL with Google BigQuery

I am trying to extract the product names from the URLs below using regex in Google BigQuery:

https://www.example.com/en/uk/product/clothing/trousers-leggings/cool-grey-joggers-9800977

Product name should be: cool-grey-joggers

https://www.example.com/en/uk/product/hot-denim-dress-7842666

Product name should be: hot-denim-dress

I have tried the following regular expression:

'/product/(.+)-[0-9]+$'

but this seems to be greedy matching and will return:

clothing/trousers-leggings/cool-grey-joggers

hot-denim-dress

I have tried numerous variations of the above but all are either not matching anything or capturing too much.

How can I modify so it captures the product name only and not the the parent sub folders as well?

Upvotes: 2

Views: 695

Answers (2)

The fourth bird
The fourth bird

Reputation: 163217

If /product/ has to be part of the path:

/product/(?:[^/]+/)*([^/]+)-[0-9]+$

The pattern matches:

  • /product/ Match literally
  • (?:[^/]+/)* Optionally repeat matching parts that end with a /
  • ([^/]+) Capture group 1, match 1+ occurrence of any char except /
  • -[0-9]+$ Match - and 1+ digits till the end of the string

Regex demo

Upvotes: 3

Sergey Geron
Sergey Geron

Reputation: 10152

Try this one:

SELECT REGEXP_EXTRACT('https://www.example.com/en/uk/product/clothing/trousers-leggings/cool-grey-joggers-9800977', r'([^/]+)-[0-9]+$')
UNION ALL
SELECT REGEXP_EXTRACT('https://www.example.com/en/uk/product/hot-denim-dress-7842666', r'([^/]+)-[0-9]+$')

enter image description here

Upvotes: 1

Related Questions