maq
maq

Reputation: 1226

Regex check if given string is relative URL

First, I have read this question about how to check if string is an absolute or relative URL. My problem is I need a regex to check if a given string is a relative URL or not, i.e. I need a regex to check if a string does not start with any protocol or double slash //.

Actually, I am doing web scraping with Beautiful Soup and I want to retrieve all relative links. Beautiful Soup uses this syntax:

soup.findAll(href=re.compile(REGEX_TO_MATCH_RELATIVE_URL))

So, that's why I need this.

Test cases are

about.html
tutorial1/
tutorial1/2.html
/
/experts/   
../ 
../experts/ 
../../../   
./  
./about.html

Thank you so much.

Upvotes: 2

Views: 15782

Answers (3)

gijswijs
gijswijs

Reputation: 2128

I prefer this one, it captures more edge cases:

(?:url\(|<(?:link|script|img)[^>]+(?:src|href)\s*=\s*)(?!['"]?(?:data|http))['"]?([^'"\)\s>]+)

Source: https://www.regextester.com/94254

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626903

Since you find it helpful, I am posting my suggestion.

The regular expression can be:

^(?!www\.|(?:http|ftp)s?://|[A-Za-z]:\\|//).*

See demo

Note that it is becoming more and more unreadable if you start adding exclusions or more alternatives. Thus, perhaps, use VERBOSE mode (declared with re.X):

import re
p = re.compile(r"""^                    # At the start of the string, ...
                   (?!                  # check if next characters are not...
                      www\.             # URLs starting with www.
                     |
                      (?:http|ftp)s?:// # URLs starting with http, https, ftp, ftps
                     |
                      [A-Za-z]:\\       # Local full paths starting with [drive_letter]:\  
                     |
                      //                # UNC locations starting with //
                   )                    # End of look-ahead check
                   .*                   # Martch up to the end of string""", re.X)
print(p.search("./about.html"));          # => There is a match
print(p.search("//dub-server1/mynode"));  # => No match

See IDEONE demo

The other Washington Guedes's regexes

  1. ^([a-z0-9]*:|.{0})\/\/.*$ - matches

    • ^ - beginning of the string
    • ([a-z0-9]*:|.{0}) - 2 alternatives:
    • [a-z0-9]*: - 0 or more letters or digits followed with :
    • .{0} - an empty string
    • \/\/.* - // and 0 or more characters other than newline (note you do not need to escape / in Python)
    • $ - end of string

So, you can rewrite it as ^(?:[a-z0-9]*:)?//.*$. he i flag should be used with this regex.

  1. ^[^\/]+\/[^\/].*$|^\/[^\/].*$ - is not optimal and has 2 alternatives

Alternative 1:

  • ^ - start of string
  • [^\/]+ - 1 or more characters other than /
  • \/ - Literal /
  • [^\/].*$ - a character other than / followed by any 0 or more characters other than a newline

Alternative 2:

  • ^ - start of string
  • \/ - Literal /
  • [^\/].*$ - a symbol other than / followed by any 0 or more characters other than a newline up to the end of string.

It is clear that the whole regex can be shortened to ^[^/]*/[^/].*$. The i option can safely be removed from the regex flags.

Upvotes: 11

user4227915
user4227915

Reputation:

To match absolutes:

/^([a-z0-9]*:|.{0})\/\/.*$/gmi

Live testing here.


And to match relatives:

/^[^\/]+\/[^\/].*$|^\/[^\/].*$/gmi

Live testing here.

Upvotes: 2

Related Questions