Reputation: 308
So I've been playing with writing some web crawlers and testing them on different sites. But I've come across some sites that seem like their relative urls should not work, or at least I think they should point to someplace other than where the browser resolves them to.
Given a url of a current page : "http://www.examplesite.com/a/page.htm" And a link of: "a/page2.htm"
The browser correctly resolves this as: "http://www.examplesite.com/a/page2.htm"
My problem/feeling (obviously wrong, but I'm wondering why) is that this should resolve to "http://www.examplesite.com/a/a/page2.htm". The relative url does not begin with a /, so why does it become base relative?
Interestingly, Java's URL class appears to agree with me, as the following code will output : "http://www.examplesite.com/a/a/page2.htm"
URL baseUrl = new URL("http://www.examplesite.com/a/page.htm");
URL absoluteURL = new URL(baseURL,"a/page2.htm");
Why does this link resolve the way it does, and what is the formal rule for resolving a relative link like this?
EDIT:
I just notice that in the <head>
portion of the webpage there is a field like so:
<base href="http://examplesite.com/">
I'm assuming that this overrides any relative links to use that as its base url instead of the actual url. Is this a correct assumption? Is that even a valid html markup?
Upvotes: 1
Views: 109
Reputation: 4778
The site is likely using a <base>
tag to specify the parent as the prefix to all relative URL's on the site.
You can find out more on the base tag here. If this is not the case, then please provide the source URL as this defies normal behavior.
Upvotes: 3
Reputation: 15797
You are correct in that it is the base
tag, and yes it is valid.
In HTML, links and references to external images, applets, form-processing programs, style sheets, etc. are always specified by a URI. Relative URIs are resolved according to a base URI, which may come from a variety of sources. The BASE element allows authors to specify a document's base URI explicitly.
When present, the BASE element must appear in the HEAD section of an HTML document, before any element that refers to an external source. The path information specified by the BASE element only affects URIs in the document where the element appears.
Sources: W3C Wiki and W3C Markup
Upvotes: 4