cget
cget

Reputation: 410

SQL RegExp to pull the slug from a URL

I want to pull the slug from a URL using SQL regular expressions

https://example.com/december-2019/content/this-is-the-slug/

I eventually came across this

SELECT
regexp_extract(url, r'\/([a-z0-9_-]*[\/]?)$') slug
FROM table

This gives me this-is-the-slug/

However, I just want this-is-the-slug

I know I can just nest my query and remove the end '/', but is there anything I can add to the RegExp '\/([a-z0-9_-]*[\/]?)$' above that will do it all in one?

Also, how does the RegExp work? I'm a bit confused about how it operates. Is it something like this

$ # start at end of string
/( ) # work back and find the first '/' and create a capturing group ()
[ ]* # create a character set [] in the capturing group and match all elements in it using *
a-z0-9_- # match all these characters in the character set and pull them out.
[\/]? # something to do with greedy matching? Not sure what this does.

Any help on this would be great. Thanks

Upvotes: 0

Views: 940

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

I would still recommend using REGEXP in cases like yours - your regular expression was very close to what it should of be - you just need to move /? outside of captured group like in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'https://example.com/december-2019/content/this-is-the-slug1/' url UNION ALL
  SELECT 'https://example.com/december-2019/content/this-is-the-slug2' 
)
SELECT 
  REGEXP_EXTRACT(url, r'\/([a-z0-9_-]*?)/?$') slug
FROM `project.dataset.table`  

with result

Row slug     
1   this-is-the-slug1    
2   this-is-the-slug2    

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520968

Here is a non regex option using the SPLIT() function:

SELECT ARRAY_REVERSE(SPLIT(RTRIM(url, '/'), '/'))[SAFE_OFFSET(0)]
FROM yourTable;

This approach removes the trailing path separator /, then splits the URL on / to generate an array. Note that we use RTRIM for this purpose. This means that the URL input won't be affected if it doesn't have a trailing path separator. The last entry in that array is retained, which should be the slug.

Upvotes: 1

Related Questions