jjoelson
jjoelson

Reputation: 5941

Interpreting a Relative path in a URL

I'm writing a 'webcrawler' in python that takes a URL and does a depth-first search following links down to some limited depth. The problem I'm having is interpreting relative paths in URLS.

On the page http://learnyouahaskell.com/introduction/ have a look at the "Starting Out" link; it looks like <a href="starting-out" class="nxtlink">Starting Out</a>. How can I determine whether this link refers to "http://learnyouahaskell.com/introduction/starting-out" or "http://learnyouahaskell.com/starting-out"? The second one is correct according to my browser.

Yet on the page http://math.colgate.edu/~mionescu/math399s11/ there is a link <a href="Finalprojects.pdf">here</a> which resolves to "http://math.colgate.edu/~mionescu/math399s11/Finalprojects.pdf".

Can someone explain this inconsistency to me? How can I determine how these paths should be resolved in my crawler?

Upvotes: 1

Views: 352

Answers (1)

iivel
iivel

Reputation: 2576

The reason for this 'apparent' inconsistency is that the learnyouahaskell site is using the <base href=""> tag in their source. This directs all domainless hrefs to use the base as their starting point.

Without the base tag it would have appeared as expected (the first link you post) and acted just like the math.colgate.edu link.

Upvotes: 3

Related Questions