ernie
ernie

Reputation: 3

regex to match only .gov tlds

I am trying to write a regex to grab an entire url of any .gov or .edu web address to make it into a link.

I currently have:

/(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/

all in () so i can regurgitate it for any url, but I only want .gov or .edu ones.

Thanks in advance.

Upvotes: 0

Views: 688

Answers (1)

Donald Miner
Donald Miner

Reputation: 39913

[-A-Z0-9+&@#\/%?=~_|!:,.;]* appears to be slurping up most of the url, so we need to jam the .gov and .edu in here somewhere. The quickest solution would be:

[-A-Z0-9+&@#\/%?=~_|!:,.;]+(\.gov|\.edu)[-A-Z0-9+&@#\/%?=~_|!:,.;]*

However, this will match a url like: http://www.example.com/evil.gov/test.html

To fix this, we can take out the / that it is matching before the top level domain:

[-A-Z0-9+&@#%?=~_|!:,.;]+(\.gov|\.edu)[-A-Z0-9+&@#\/%?=~_|!:,.;]*

Or, in closing, we have:

/(\b(https?|ftp):\/\/[-A-Z0-9+&@#%?=~_|!:,.;]+(\.gov|\.edu)[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|]?)/

Due to the problem that it doesn't match example.gov, I added a ? to the last token.

Damn that is ugly.

Upvotes: 1

Related Questions