Reputation: 143
Due to stackoverflow.com, I have this:
Document doc = Jsoup.connect(urlFromUser).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36").timeout(0).get();
doc.absUrl(urlFromUser);
doc.setBaseUri(urlFromUser);
Elements elements = doc.select("body");
Elements imgElements = doc.select("img");
for (Element element : imgElements) {
element.attr("src", element.attr("abs:src"));
}
Elements hrefElements = doc.select("a");
for (Element element : hrefElements) {
element.attr("href", "http://www.some.com/translit/lat2cyr?" + element.attr("abs:href"));
}
Elements linkElements = doc.head().select("link");
for (Element element : linkElements) {
element.attr("href", element.attr("abs:href"));
writer.print("");
manipulateElements(elements);
}
The result is:
<link rel="stylesheet" href="css/windows/windows.css?">
But I need this:
<link rel="stylesheet" href="http://DOMAIN.com/css/windows/windows.css?">
I tried this but it doesn't solve the problem:
String host = uri.getHost();
host = "http://" + host;
writer.print(doc.toString().replaceAll("href=\"/css/", "href=\"" + host + "/css/").replaceAll("/jscript/", host + "/jscript/").replaceAll("/styles/", host + "/styles/").replaceAll("/functions/", host + "/functions/").replaceAll("href=\"/templates/", host + "/templates/").replaceAll("href=\"/plugins/", host + "/plugins/").replaceAll("href=\"css/", "href=\"" + host + "/css/"));
writer.close();
Upvotes: 1
Views: 343
Reputation: 43033
For acheiving your goal you would need a custom OuterHtmlVisitor
. It would generate absolute urls instead of relative ones. Unfortunately, as of JSoup 1.8.3
this class is internal.
You may try to write a custom NodeVisitor
implementation but it's too much work.
On the other hand, here is a workaround:
// Fetch the document. JSoup will set the baseUri for us automatically...
Document doc = Jsoup //
.connect(urlFromUser) //
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36") //
.timeout(0) //
.get();
// Turn any url into an absolute url
String myTargetedTags = "img, a, link";
for (Element e : doc.select(myTargetedTags)) {
switch (e.tagName().toLowerCase()) {
case "img":
e.attr("src", e.absUrl("src"));
break;
case "a":
e.attr("href", "http://www.some.com/translit/lat2cyr?" + e.absUrl("href"));
break;
case "link":
e.attr("href", e.absUrl("href"));
break;
default:
throw new RuntimeException("Unexpected element:\n" + e.outerHtml());
}
}
// Print out the final result
writer.print(doc.outerHtml());
writer.flush(); // Just to be sure that everything goes out...
writer.close();
Note: For large documents, I don't know how this code performs.
SAMPLE CODE
// Fetch the document. JSoup will set the baseUri for us automatically...
Document doc = Jsoup
.parse( //
"<html><head><link rel=\"stylesheet\" type=\"text/css\" href=\"/css/main.css\"></head><body><img src=\"img/my-image.jpg\"><a href=\"/page/page.html\">an anchor</a></body></html>", //
"http://localhost");
System.out.println("** BEFORE**\n" + doc.outerHtml());
// Turn any url into an absolute url
// (same lines as above...)
// Print out the final result
System.out.println("\n** AFTER **\n" + doc.outerHtml());
OUTPUT
** BEFORE **
<html>
<head>
<link rel="stylesheet" type="text/css" href="/css/main.css">
</head>
<body>
<img src="img/my-image.jpg">
<a href="/page/page.html">an anchor</a>
</body>
</html>
** AFTER **
<html>
<head>
<link rel="stylesheet" type="text/css" href="http://localhost/css/main.css">
</head>
<body>
<img src="http://localhost/img/my-image.jpg">
<a href="http://www.some.com/translit/lat2cyr?http://localhost/page/page.html">an anchor</a>
</body>
</html>
Tested on JSoup 1.8.3
Upvotes: 1