user16045165
user16045165

Reputation: 23

Regular expression to extract different parts of URL and path

Consider URLs like

https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
http://stackoverflow.com/v2/summary/saas?test=123

I need a regular expression to match these URLs and convert them into

stackoverflow.com:v1:summary:1243PQ:details:P1:9981
stackoverflow.com:v2:summary:saas

I need to build a single rule using regex where I can extract paths using $1, $2, etc. without using any javascript logic as I need to use it in a classification rule builder tool. I tried this URL contains ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? and extracted $4:$5 which returns stackoverflow.com:v1/summary/1243PQ/details/P1/9981

But, this is incorrect. Can anyone help me with the correct regex for this?

Upvotes: 1

Views: 2927

Answers (1)

Hao Wu
Hao Wu

Reputation: 20734

You may try this:

Regex

/(?:https?:\/\/([^\/?\s#]+))?\/([^\/?\s#]*)(?:[\?#].*)?/g

Substitution

$1:$2

(?:                     non-capturing group
    https?:\/\/         "http://" or "https://"
    ([^\/?\s#]+)        capture the domain and put it in group 1
)?                      make this capture optional
\/                      "/"
([^\/?\s#]*)            one segment of the url path, capture it in group 2
(?:[\?#].*)?            an optional non-capturing group for consuming query string or # anchor at the end

Check the test cases







Update

If you can't use g flag for substitution, there's no better way but bruteforce all the combinations:

You need to add a \/([^\/?#\s]+) and :$2 etc for each segment of the url path:


  • https://stackoverflow.com
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/?(?:[#?].*)?$
$1

  • https://stackoverflow.com/path1
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2

  • https://stackoverflow.com/path1/path2
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3

  • https://stackoverflow.com/path1/path2/path3
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4

  • https://stackoverflow.com/path1/path2/path3/path4
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5

  • https://stackoverflow.com/path1/path2/path3/path4/path5
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5:$6

Upvotes: 1

Related Questions