Reputation: 10554
I've found lots of answers on the server side for the relative-path-with-trailing-slash question, but none on the client side. Help me out here.
I'm writing a web crawler to take statistics on a set of websites, and am running into a problem. One website I'm working with has a navbar with relative paths with trailing slashes, and intends those paths to be treated as absolute, like so:
on page http://www.example.com/foo/bar
navbar links addresses -> foo/
, baz/
, quox/
intended absolute urls -> http://www.example.com/foo/
, http://www.example.com/baz/
, http://www.example.com/quox/
The problem is, as far as I can tell, this is nonstandard behavior - and yet Firefox and Chrome both handle those paths as absolute. According to RFC 1808, and RFC 2396, these should be handled like relative paths, like this:
spec-correct absolute urls -> http://www.example.com/foo/foo/
, http://www.example.com/foo/baz/
, http://www.example.com/foo/quox/
In particular at section 5.1 in RFC 1808 and C.1 in RFC 2396, the 4th example shows this case specifically being treated as a relative path. In Ruby, which I'm writing the crawler in, the Addressable gem handles these according to spec.
What's worse is the server in question is happy to return 200 OK for these paths, and all of them have this navbar: so I end up crawling http://www.example.com/foo/
which is the same page as http://www.example.com/foo/foo/
, http://www.example.com/foo/foo/foo/
and so on, combinatorially to bizarre URLs like http://www.example.com/foo/baz/quox/foo/
So here's the question: Am I missing something that allows Chrome and Firefox to both interpret these urls as absolute paths? Is there any way to disambiguate the case where the spec is correct and the absolute path is what is intended?
Upvotes: 3
Views: 1281