Interpreting a Relative path in a URL

Question

I'm writing a 'webcrawler' in python that takes a URL and does a depth-first search following links down to some limited depth. The problem I'm having is interpreting relative paths in URLS.

On the page http://learnyouahaskell.com/introduction/ have a look at the "Starting Out" link; it looks like Starting Out. How can I determine whether this link refers to "http://learnyouahaskell.com/introduction/starting-out" or "http://learnyouahaskell.com/starting-out"? The second one is correct according to my browser.

Yet on the page http://math.colgate.edu/~mionescu/math399s11/ there is a link here which resolves to "http://math.colgate.edu/~mionescu/math399s11/Finalprojects.pdf".

Can someone explain this inconsistency to me? How can I determine how these paths should be resolved in my crawler?

iivel · Accepted Answer

The reason for this 'apparent' inconsistency is that the learnyouahaskell site is using the tag in their source. This directs all domainless hrefs to use the base as their starting point.

Without the base tag it would have appeared as expected (the first link you post) and acted just like the math.colgate.edu link.

Interpreting a Relative path in a URL

Answers (1)

Related Questions