Reputation: 4279
I'm trying to parse out navigation links from various sites.
I've been having issues with one particular site which uses a relative format prefixed with ./
Here is the code snippet with relevant param values in comments:
// url = http://megatokyo.com/strip/1456
// selector = ".next a"
// ele = <a href="./strip/1457">Next</a>
// attr = "href"
Element ele = doc.select(selector).get(index);
ele.setBaseUri(url);
String absoluteUrl = ele.absUrl(attr).trim().replaceAll("\n", "");
Jsoup returns:
http://megatokyo.com/strip/strip/1457
when in fact the real link is:
http://megatokyo.com/strip/1457
From my understanding Jsoup is giving the correct link here as ./
refers to the current directory (http://megatokyo.com/strip/
) meaning that the anchor is done incorrectly on the site. However Chrome, Firefox and IE all resolve the relative URL to point to the next strip instead of /strip/strip/1457
. Is there any way I can correct for this behaviour without breaking relative URLs in other cases?
Upvotes: 2
Views: 608
Reputation: 2875
The problem:
If you have a look at the header of the html source, you will find:
<head>
...
<base href="http://megatokyo.com/" />
</head>
What does it mean?
For all relative urls in the document, this will be used as the base (so this is the current directory ./
). See: http://www.w3schools.com/tags/tag_base.asp
Fix:
Jsoup allready detects the <base>
tag and ele.absUrl("href")
would (and does, just tested it) return http://megatokyo.com/strip/1457
but you are overriding the correct settings with ele.setBaseUri(url);
, so remove this line of code.
If you want to handle setting the correct base yourself, just parse the head for a <base>
element:
String url = "http://megatokyo.com/strip/1456";
Element base = doc.select("head > base[href]").first();
String baseUrl = base!=null ? base.attr("href") : url;
Element ele = doc.select("#comic > div > div.navcontrols.top > ul > li.next > a").first();
ele.setBaseUri(baseUrl);
System.out.println(ele.attr("abs:href"));
Upvotes: 2