Jacques
Jacques

Reputation: 7135

Regular Expression break down URL into parts

I've just recently started learning Regex so i'm not sure yet about a couple of aspects of the hole thing.

Right now my web page reads in the URL breaks it up into parts and only uses certain parts for processing: E.g. 1) http://mycontoso.com/products/luggage/selloBag E.g. 2) http://mycontoso.com/products/luggage/selloBag.sf404.aspx

For some reason Sitefinity is giving us both possibilities, which is fine, but what I need from this is only the actual product details as in "luggage/selloBag"

My current Regex expression is: "(.*)(map-search)(\/)(.*)(\.sf404\.aspx)", I combine this with a replace statement and extract the contents of group 4 (or $4), which is fine, but it doesn't work for example 2.

So the question is: Is it possible to match 2 possibilities with regular expressions where a part of a string might or might not be there and then still reference a group whose value you actually want to use?

Upvotes: 0

Views: 1492

Answers (3)

Richard H
Richard H

Reputation: 39055

You don't say if you're doing this in javascript, but if you are, the parseUri lib written by Steven Levithan does a pretty damn good job at parsing urls. You can get it from various places, including here (click on the "Source Code" tab) and here.

Upvotes: 0

ridgerunner
ridgerunner

Reputation: 34395

RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:

re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme    = $2
# authority = $4
# path      = $5
# query     = $7
# fragment  = $9

Here is an enhanced (and commented) regex (in Python syntax) which utilizes named capture groups:

    re_3986_enhanced = re.compile(r"""
        # Parse and capture RFC-3986 Generic URI components.
        ^                                    # anchor to beginning of string
        (?:  (?P<scheme>    [^:/?#\s]+): )?  # capture optional scheme
        (?://(?P<authority>  [^/?#\s]*)  )?  # capture optional authority
             (?P<path>        [^?#\s]*)      # capture required path
        (?:\?(?P<query>        [^#\s]*)  )?  # capture optional query
        (?:\#(?P<fragment>      [^\s]*)  )?  # capture optional fragment
        $                                    # anchor to end of string
        """, re.MULTILINE | re.VERBOSE)

For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation

Upvotes: 5

Fred Foo
Fred Foo

Reputation: 363627

Depends on your regex implementation, but most support a syntax like

(\.sf404\.aspx|)

Assuming that's your group 4 (i.e. zero-indexed groups). The | lists two alternatives, one of which is the empty string.

Upvotes: 0

Related Questions