Reputation: 165
I am trying to extract the product names from the URLs below using regex in Google BigQuery:
https://www.example.com/en/uk/product/clothing/trousers-leggings/cool-grey-joggers-9800977
Product name should be: cool-grey-joggers
https://www.example.com/en/uk/product/hot-denim-dress-7842666
Product name should be: hot-denim-dress
I have tried the following regular expression:
'/product/(.+)-[0-9]+$'
but this seems to be greedy matching and will return:
clothing/trousers-leggings/cool-grey-joggers
hot-denim-dress
I have tried numerous variations of the above but all are either not matching anything or capturing too much.
How can I modify so it captures the product name only and not the the parent sub folders as well?
Upvotes: 2
Views: 695
Reputation: 163217
If /product/
has to be part of the path:
/product/(?:[^/]+/)*([^/]+)-[0-9]+$
The pattern matches:
/product/
Match literally(?:[^/]+/)*
Optionally repeat matching parts that end with a /
([^/]+)
Capture group 1, match 1+ occurrence of any char except /
-[0-9]+$
Match -
and 1+ digits till the end of the stringUpvotes: 3
Reputation: 10152
Try this one:
SELECT REGEXP_EXTRACT('https://www.example.com/en/uk/product/clothing/trousers-leggings/cool-grey-joggers-9800977', r'([^/]+)-[0-9]+$')
UNION ALL
SELECT REGEXP_EXTRACT('https://www.example.com/en/uk/product/hot-denim-dress-7842666', r'([^/]+)-[0-9]+$')
Upvotes: 1