Isaiah Lee
Isaiah Lee

Reputation: 687

Validate a string to be URL safe using regex

I have a site where users can pick a username. Currently, they can put in almost any characters including things such as @ ! # etc.

I know I can use a regex, and that's probably what I'm opting for.

I'll be using a negated set, which I'm assuming is the right tool here as so:

[^@!#]

So, how can I know all of the illegal characters to put in that set? I can start manually putting in the ones that are obvious such as !@#$%^&*(), but is there an easy way to do this without manually putting every single one of them in?

I know a lot of sites only allow strings that contain alphabets, numbers, dashes, or underscores. Something like that would work well for me.

Any help would be greatly appreciated.

Thanks S.O.!

Upvotes: 18

Views: 17807

Answers (4)

CleverPatrick
CleverPatrick

Reputation: 9493

All the answers on this question seem to assume English language. To allow for Unicode characters (so people can have URLs / user names in their native language), it is better to use a blacklist of reserved / unsafe characters rather than a whitelist of characters.

Here is a regex that matches characters which are generally unsafe in a URL:

([&$\+,:;=\?@#\s<>\[\]\{\}[\/]|\\\^%])+

Link to test RegEx

(list based on unsafe characters mentioned in this answer)

Upvotes: 2

hwnd
hwnd

Reputation: 70732

Instead of using negation, place only what you want to allow inside of your character class.

^[a-zA-Z0-9_-]*$

Explanation:

^                 # the beginning of the string
 [a-zA-Z0-9_-]*   #  any character of: 'a' to 'z', 'A' to 'Z', 
                  #  '0' to '9', '_', '-' (0 or more times)
$                 # before an optional \n, and the end of the string

Upvotes: 33

OnlineCop
OnlineCop

Reputation: 4069

One of the reasons you'll want to use an inclusive set is that limiting bad characters is very difficult with all the Unicode variants out there. Characters such as ß, ñ, oœ, æ will probably give you a headache. If you limit the username to just a subset of letters that YOU provide, you can easily chop out everything else that you may not want in there.

Upvotes: 2

brunofitas
brunofitas

Reputation: 3083

Instead of denying values, maybe it's better to only allow some

[:word:] -- Digits, letters and underscore

Check this chart

http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

Upvotes: 3

Related Questions