Juancho
Juancho

Reputation: 639

Regex: validate a URL path with no query params

I'm not a regex expert and I'm breaking my head trying to do one that seems very simple and works in python 2.7: validate the path of an URL (no hostname) without the query string. In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /, ., -

I found this post that is very similar to what I need but for me isn't working at all, I can test with for example aaa and it will return true even if it doesn't start with /.

The current regex that I have kinda working is this one:

[^/+a-zA-Z0-9.-]

but it doesn't work with paths that don't start with /. For example:

Upvotes: 8

Views: 11023

Answers (4)

Burhan Khalid
Burhan Khalid

Reputation: 174662

In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /, ., -

You are missing some characters that are valid in URLs

import string
import urllib
import urlparse

valid_chars = string.letters + string.digits + '/.-~'
valid_paths = []

urls = ['http://www.my.uni.edu/info/matriculation/enroling.html',
    'http://info.my.org/AboutUs/Phonebook',
    'http://www.library.my.town.va.us/Catalogue/76523471236%2Fwen44--4.98',
    'http://www.my.org/462F4F2D4241522A314159265358979323846',
        'http://www.myu.edu/org/admin/people#andy',
        'http://www.w3.org/RDB/EMP?*%20where%20name%%3Ddobbins']

for i in urls:
   path = urllib.unquote(urlparse.urlparse(i).path)
   if path[0] == '/' and len([i for i in path if i in valid_chars]) == len(path):
        valid_paths.append(path)

Upvotes: 3

Morten Jensen
Morten Jensen

Reputation: 5936

Try posting some more code. I can't figure out how you're using your regex from your question. What's confusing me is, your re expression [^/+a-zA-Z0-9.-] basically says:

Match a single character if it is:

not a / or a-z (caps and lower both) or 0-9 or a dot or a dash

It doesn't quite make sense to me without knowing how you use it, as it only matches a single charactre and not a whole URL string.

I'm not sure I understand why you cannot start with a /.

Upvotes: 0

Andrew Cheong
Andrew Cheong

Reputation: 30283

The regex you've defined is a character class. Instead, try:

^\/[/.a-zA-Z0-9-]+$

Upvotes: 6

Gábor Lipták
Gábor Lipták

Reputation: 9776

Try this:

^(?:/[a-zA-Z0-9.-&&[^/]]*)+$

Seems to work. See the picture: enter image description here

Upvotes: 0

Related Questions