Reputation: 1226
First, I have read this question about how to check if string is an absolute or relative URL. My problem is I need a regex to check if a given string is a relative URL or not, i.e. I need a regex to check if a string does not start with any protocol or double slash //
.
Actually, I am doing web scraping with Beautiful Soup and I want to retrieve all relative links. Beautiful Soup uses this syntax:
soup.findAll(href=re.compile(REGEX_TO_MATCH_RELATIVE_URL))
So, that's why I need this.
Test cases are
about.html
tutorial1/
tutorial1/2.html
/
/experts/
../
../experts/
../../../
./
./about.html
Thank you so much.
Upvotes: 2
Views: 15782
Reputation: 2128
I prefer this one, it captures more edge cases:
(?:url\(|<(?:link|script|img)[^>]+(?:src|href)\s*=\s*)(?!['"]?(?:data|http))['"]?([^'"\)\s>]+)
Source: https://www.regextester.com/94254
Upvotes: 0
Reputation: 626903
Since you find it helpful, I am posting my suggestion.
The regular expression can be:
^(?!www\.|(?:http|ftp)s?://|[A-Za-z]:\\|//).*
See demo
Note that it is becoming more and more unreadable if you start adding exclusions or more alternatives. Thus, perhaps, use VERBOSE mode (declared with re.X
):
import re
p = re.compile(r"""^ # At the start of the string, ...
(?! # check if next characters are not...
www\. # URLs starting with www.
|
(?:http|ftp)s?:// # URLs starting with http, https, ftp, ftps
|
[A-Za-z]:\\ # Local full paths starting with [drive_letter]:\
|
// # UNC locations starting with //
) # End of look-ahead check
.* # Martch up to the end of string""", re.X)
print(p.search("./about.html")); # => There is a match
print(p.search("//dub-server1/mynode")); # => No match
See IDEONE demo
The other Washington Guedes's regexes
^([a-z0-9]*:|.{0})\/\/.*$
- matches
^
- beginning of the string([a-z0-9]*:|.{0})
- 2 alternatives:[a-z0-9]*:
- 0 or more letters or digits followed with :
.{0}
- an empty string\/\/.*
- //
and 0 or more characters other than newline (note you do not need to escape /
in Python)$
- end of stringSo, you can rewrite it as ^(?:[a-z0-9]*:)?//.*$
. he i
flag should be used with this regex.
^[^\/]+\/[^\/].*$|^\/[^\/].*$
- is not optimal and has 2 alternativesAlternative 1:
^
- start of string[^\/]+
- 1 or more characters other than /
\/
- Literal /
[^\/].*$
- a character other than /
followed by any 0 or more characters other than a newlineAlternative 2:
^
- start of string\/
- Literal /
[^\/].*$
- a symbol other than /
followed by any 0 or more characters other than a newline up to the end of string.It is clear that the whole regex can be shortened to ^[^/]*/[^/].*$
. The i
option can safely be removed from the regex flags.
Upvotes: 11
Reputation:
To match absolutes:
/^([a-z0-9]*:|.{0})\/\/.*$/gmi
And to match relatives:
/^[^\/]+\/[^\/].*$|^\/[^\/].*$/gmi
Upvotes: 2