aardvarkk
aardvarkk

Reputation: 15996

A method for standardizing URI/URLs in Ruby in a forgiving manner

I'm trying to find a method to take a URI/URL string from a user and determine a working, canonical form (or failing if the resource isn't valid). Simultaneously, it should also verify that the URL exists. So we're checking for both valid "syntax" and also existence.

For instance, a string like google.com should be turned into http://www.google.com, and a string like google.com/insights should be turned into http://www.google.com/insights. A string like http://thiswebsitedoesntexistatall.com should return some sort of error or exception.

I believe a portion of the solution may likely be calling an HTTP get_response() method and following redirects until I get a 200 OK status.

It seems like the URI.parse() method is not forgiving of leaving off the http. I realize I could write a simple thing to try adding http in front, etc., but I was hoping there was some existing gem or little-known library function that would be really forgiving about URLs and canonicalize them for me.

Both the built in net/http and HTTParty seem to be too strict for what I'm looking for. Is there a nice way to do this?

Upvotes: 4

Views: 672

Answers (1)

the Tin Man
the Tin Man

Reputation: 160631

There are some problems with what you're asking for:

  • A URL parser shouldn't assume the value passed in is HTTP, when FTP and many other protocols are equally valid. If you know the protocol is likely to be HTTP, then you need to add the protocol.
  • If you try to connect to a site and follow redirects until you get a 200 response, you've only proven that the URL resolves to a valid page of some sort. That 200 could be an error-page returned because the one you want is a dead link or invalid, or that the site is temporarily down. To prove otherwise means you have to have some intimate pre-knowledge about the page you're looking for, such as specific content to search for.
  • Assuming the URL is good after you follow the redirects, is not quite safe. Many sites add on all sorts of session data to the URL, so what could start as a simple and clean URL can resolve to a long and convoluted one.

I'd recommend you look at the Addressable::URI gem. It's much more full-featured than Ruby's URI. It won't make the decisions for you, but at least it will give you a more complete API and can rewrite/normalize URLs. Cleaning them up and/or determining if they are good is still left as an exercise for you.

Upvotes: 3

Related Questions