Dušan Rychnovský
Dušan Rychnovský

Reputation: 12459

How to resolve cannonical URLs of web page links

Background information:

I am trying to build a very simple web crawler in Groovy. It would, given a single URL address, download the associated web page and all pages linked from that page.

In the links in the HTML code, the URL addresses are sometimes abbreviated. Three different URL types come to mind:

I am however aware of the fact that web applications can implement arbitrary URL routing and that the URL addresses therefore might not reflect the structure of the filesystem at all.

My question:

How does the web browser know which URL to ask for when the user clicks a link in a web page? Or how would my crawler know which web page to download when it finds a link in a web page?

Any hints on available Groovy libraries to resolve URLs would also be appreciated.

Upvotes: 0

Views: 291

Answers (2)

Dušan Rychnovský
Dušan Rychnovský

Reputation: 12459

The java.net.URI class, which is available in the standard library, provides means to resolve relative references via the URI#resolve(String) method.

See javadoc documentation.

Upvotes: 1

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201628

Browsers resolve relative URLs (including URLs relative to server root, such as /fruit/orange.html) according to URL specifications, see Internet-standard STD 66, which is currently RFC 3986. In addition to general considerations, they need to take into account <base href=...> tags if present.

This has nothing to do with file systems. If a URL happens to get mapped to a file in a server, that’s internal to the server.

Canonical URLs are something different. Using a link element with rel=canonical, a page may specify its canonical URL, which should be used for in search engines for example. See e.g. http://googlewebmastercentral.blogspot.fi/2009/02/specify-your-canonical.html

Upvotes: 2

Related Questions