JAL
JAL

Reputation: 21563

What is the best way to filter URLs for input?

I have a form that is accepting URLs from users in PHP.

What characters should I allow or disallow? Currently I use

$input= preg_replace("/[^a-zA-Z0-9-\?:#.()\,/\&\'\\"]/", "", $string);

$input=substr($input,0,255);

So, it's trimmed to 255 chars and only can include letters, numbers, and ? - _ : # ( ) , & ' " /

Anything I should be stripping that I'm not, or anything I'm stripping that might need to be in a valid URL?

Upvotes: 3

Views: 1272

Answers (5)

Hugo Nicolau
Hugo Nicolau

Reputation: 1

Nowadays there's input type="url", it can be used for simpler applications and maybe complex ones too.

Upvotes: 0

Mike Boers
Mike Boers

Reputation: 6745

I would suggest you parse the URI according to the specs (being somewhat lenient about illegal characters) and then rebuilding it strictly according to the specs... Which sounds like a lot but I've got a headstart with a class I wrote and use for my own projects.

I have put it on pastebin, because it is rather large.

Example:

$uri = new N_Uri('http://example.com/path/segments/with spaces?key=value');
echo $uri;

Prints out: http://example.com/path/segments/with%20spaces?key=value

Upvotes: 1

David Z
David Z

Reputation: 131800

RFC 1738 which defines the URL specification states that only the characters

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+.-

may be used within a URL scheme, and only the characters

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789$-_.+!*'(),;/?:@=&

may be used unencoded within the scheme-specific part of a URL. (;/?:@=&, if used unencoded, must be used for their "reserved purposes", but if you're just checking for invalid characters you don't need to worry about that). So if you want full generality, I'd check the URL against this regex:

"/([a-zA-Z+.-]+:\/\/)?([a-zA-Z0-9\$\-_\.\+\!\*'\(\),\;\/\?\:\@\=\&]+)/"

(probably some of that escaping is not necessary). If you're only looking for HTTP URLs, (some of) the other answers should be fine.

Upvotes: 6

vishvananda
vishvananda

Reputation: 519

You need to allow = sign and % for things like %20. Also the @ sign is legal

You can validate the url with a regex like this

/(ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?/

Also i don't think parens and quotes are allowed in urls either.

Upvotes: 3

user65952
user65952

Reputation: 202

This is the regex I used on a TinyUrl clone site I made:

([a-zA-Z]+://)?([a-z0-9A-Z-]+\.[a-z0-9A-Z\.-]+[a-z0-9A-Z/_?=;%&,+\.\-]+)

Upvotes: 0

Related Questions