Sean W.
Sean W.

Reputation: 5132

Capture domain and path from URL with regex

I'm trying to write a regex that will capture the domain and path from a URL. I've tried:

https?:\/\/(.+)(\/.*)

That works fine for http://example.com/foo:

Match 1
0.  google.com
1.  /foo

But not what I would expect for http://example.com/foo/bar:

Expected:

Match 1
0.  google.com
1.  /foo/bar

Actual:

Match 1
0.  google.com/foo
1.  /bar

What am I doing wrong?

Upvotes: 3

Views: 4945

Answers (3)

Palec
Palec

Reputation: 13574

https?:\/\/(.+)(\/.*)

What am I doing wrong?

+ is greedy. You should use it on [^/] instead of a dot.

Also notice that your “path” part will contain also query string and fragment (hash).

This one gets just the domain (+ login, password, port) and path (without query string or fragment).

^https?://([^/]+)(/[^?#]*)?

I leave escaping the slashes accordingly up to you.

Caveat: This expects a valid URI and for such it is good and parses the authority and path sections. If you want to parse a URI according to the standard, you need to implement the whole grammar or get the official regex from §8 of RFC 2396.

The following line is the regular expression for breaking-down a URI reference into its components.

   ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
    12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

   http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

   $1 = http:
   $2 = http
   $3 = //www.ics.uci.edu
   $4 = www.ics.uci.edu
   $5 = /pub/ietf/uri/
   $6 = <undefined>
   $7 = <undefined>
   $8 = #Related
   $9 = Related

where indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the four components and fragment as

   scheme    = $2
   authority = $4
   path      = $5
   query     = $7
   fragment  = $9

Upvotes: 6

user557597
user557597

Reputation:

Something like this 'greedy' version might work. I don't know if Python requires delimiters, so this is just the raw regex.

 #   https?://([^/]+)(.*)

 https?://
 ( [^/]+ )           # (1)
 ( .* )              # (2)

Upvotes: 0

GabiMe
GabiMe

Reputation: 18503

As noted - this is a non griddy version: https?:\/\/(.+?)(\/.*)

Upvotes: 6

Related Questions