alexgophermix
alexgophermix

Reputation: 4279

Absolute URL incorrect when converted from relative URL Android JSoup

I'm trying to parse out navigation links from various sites.

I've been having issues with one particular site which uses a relative format prefixed with ./ Here is the code snippet with relevant param values in comments:

// url = http://megatokyo.com/strip/1456
// selector = ".next a"
// ele = <a href="./strip/1457">Next</a>
// attr = "href"
Element ele = doc.select(selector).get(index);
ele.setBaseUri(url);
String absoluteUrl = ele.absUrl(attr).trim().replaceAll("\n", "");

Jsoup returns:

http://megatokyo.com/strip/strip/1457

when in fact the real link is:

http://megatokyo.com/strip/1457

From my understanding Jsoup is giving the correct link here as ./ refers to the current directory (http://megatokyo.com/strip/) meaning that the anchor is done incorrectly on the site. However Chrome, Firefox and IE all resolve the relative URL to point to the next strip instead of /strip/strip/1457. Is there any way I can correct for this behaviour without breaking relative URLs in other cases?

Upvotes: 2

Views: 608

Answers (1)

Frederic Klein
Frederic Klein

Reputation: 2875

The problem:

If you have a look at the header of the html source, you will find:

<head>
    ...
    <base href="http://megatokyo.com/" />
</head>

What does it mean?

For all relative urls in the document, this will be used as the base (so this is the current directory ./). See: http://www.w3schools.com/tags/tag_base.asp

Fix:

Jsoup allready detects the <base> tag and ele.absUrl("href") would (and does, just tested it) return http://megatokyo.com/strip/1457 but you are overriding the correct settings with ele.setBaseUri(url);, so remove this line of code.

If you want to handle setting the correct base yourself, just parse the head for a <base> element:

String url = "http://megatokyo.com/strip/1456";

Element base = doc.select("head > base[href]").first();

String baseUrl = base!=null ? base.attr("href") : url;

Element ele = doc.select("#comic > div > div.navcontrols.top > ul > li.next > a").first();
ele.setBaseUri(baseUrl);

System.out.println(ele.attr("abs:href"));

Upvotes: 2

Related Questions