Reputation: 639
I'm not a regex expert and I'm breaking my head trying to do one that seems very simple and works in python 2.7: validate the path of an URL (no hostname) without the query string. In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /
, .
, -
I found this post that is very similar to what I need but for me isn't working at all, I can test with for example aaa
and it will return true even if it doesn't start with /
.
The current regex that I have kinda working is this one:
[^/+a-zA-Z0-9.-]
but it doesn't work with paths that don't start with /
. For example:
/aaa
-> true, this is ok/aaa/bbb
-> true, this is ok/aaa?q=x
-> false, this is okaaa
-> true, this is NOT okUpvotes: 8
Views: 11023
Reputation: 174662
In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /, ., -
You are missing some characters that are valid in URLs
import string
import urllib
import urlparse
valid_chars = string.letters + string.digits + '/.-~'
valid_paths = []
urls = ['http://www.my.uni.edu/info/matriculation/enroling.html',
'http://info.my.org/AboutUs/Phonebook',
'http://www.library.my.town.va.us/Catalogue/76523471236%2Fwen44--4.98',
'http://www.my.org/462F4F2D4241522A314159265358979323846',
'http://www.myu.edu/org/admin/people#andy',
'http://www.w3.org/RDB/EMP?*%20where%20name%%3Ddobbins']
for i in urls:
path = urllib.unquote(urlparse.urlparse(i).path)
if path[0] == '/' and len([i for i in path if i in valid_chars]) == len(path):
valid_paths.append(path)
Upvotes: 3
Reputation: 5936
Try posting some more code. I can't figure out how you're using your regex from your question. What's confusing me is, your re expression [^/+a-zA-Z0-9.-]
basically says:
Match a single character if it is:
not a /
or a-z (caps and lower both) or 0-9
or a dot or a dash
It doesn't quite make sense to me without knowing how you use it, as it only matches a single charactre and not a whole URL string.
I'm not sure I understand why you cannot start with a /
.
Upvotes: 0
Reputation: 30283
The regex you've defined is a character class. Instead, try:
^\/[/.a-zA-Z0-9-]+$
Upvotes: 6
Reputation: 9776
Try this:
^(?:/[a-zA-Z0-9.-&&[^/]]*)+$
Seems to work. See the picture:
Upvotes: 0