How to resolve cannonical URLs of web page links

Question

Background information:

I am trying to build a very simple web crawler in Groovy. It would, given a single URL address, download the associated web page and all pages linked from that page.

In the links in the HTML code, the URL addresses are sometimes abbreviated. Three different URL types come to mind:

an absolute URL address (such as http://www.food.com/fruit/orange.html)
an absolute URL address related to the web root (such as /fruit/orange.html)
a relative URL address, related to the directory where the current web page resides (such as ../vegetables/carrot.html)

I am however aware of the fact that web applications can implement arbitrary URL routing and that the URL addresses therefore might not reflect the structure of the filesystem at all.

My question:

How does the web browser know which URL to ask for when the user clicks a link in a web page? Or how would my crawler know which web page to download when it finds a link in a web page?

Any hints on available Groovy libraries to resolve URLs would also be appreciated.

Dušan Rychnovsk&#253; · Accepted Answer

The java.net.URI class, which is available in the standard library, provides means to resolve relative references via the URI#resolve(String) method.

See javadoc documentation.

How to resolve cannonical URLs of web page links

Answers (2)

Related Questions