Reputation: 35
I'm working on a web-scraper for a website but my current code only scrapes relative urls to images. How can I convert those urls to absolute ones?
Second problem: when I combine the link manually http://www.arena-offshore.com/iframe/list/../../res2.php?res=site/big/08032016130016552-GEMI-gözcü1.jpg&g=500&u=335
and open the link in a browser, I only see some sort of text file instead of the picture. Is it possible to get a direct link to the picture, that is displayed normally in a browser?
Current code:
Document doc;
String url = "http://www.arena-offshore.com/iframe/list/list-detail.php?category=1&page=&id=956&id=956";
try {
doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36")
.get();
Elements elements = doc.select("#u702_img");
for (Element element : elements) {
String src = element.attr("src");
System.out.println(src);
}
} catch (IOException e) {
e.printStackTrace();
}
Output
../../res2.php?res=site/big/08032016130016552-GEMI-gözcü1.jpg&g=500&u=335
Upvotes: 2
Views: 354
Reputation: 4380
The text file is the image. You can see that it is a jpg
because the file starts with:
ÿØÿàJFIFÿþ>CREATOR: gd-jpeg v1.0 (using IJG JPEG v62)
When you save the text file in your browser (Right click > Save as...) and give the file the .jpg extension it will be rendered correctly.
You can take the image URL from your src
output:
String baseUrl = "http://www.arena-offshore.com/";
String output = "../../res2.php?res=site/big/08032016130016552-GEMI-gözcü1.jpg&g=500&u=335";
int start = output.indexOf("=") + 1;
int end = output.indexOf("&", start);
String imageUrl = baseUrl + output.substring(start, end);
// Gives:
// http://www.arena-offshore.com/site/big/08032016130016552-GEMI-g%C3%B6zc%C3%BC1.jpg
Then you could download the image using jsoup:
byte[] bytes = Jsoup.connect(url).ignoreContentType(true).execute().bodyAsBytes();
Note that there is also the element.absUrl("src");
method in Jsoup to get the absolute URL of an image, although that may not work in your case since it points to a php page.
Upvotes: 1
Reputation: 2781
From your current output, just remove res2.php?res=
and ending parameters &g=500&u=335
:
You will get the direct link
http://www.arena-offshore.com/site/big/08032016130016552-GEMI-g%C3%B6zc%C3%BC1.jpg
Upvotes: 1