Reputation: 9478
I used htmlunit to scrape the images from web pages. I am beginner in htmlunit. I coded, but don't know how to get the images. Below is my code.
import java.io.*;
import java.net.URL;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class urlscrap {
public static void main(String[] args) throws Exception
{
//WebClient webClient = new WebClient(Opera);
WebClient webClient = new WebClient();
HtmlPage currentPage = (HtmlPage) webClient.getPage(new URL("http://www.google.com"));
System.out.println(currentPage.asText());
//webClient.closeAllWindows();
}
}
Upvotes: 1
Views: 4944
Reputation: 9295
Does this work for you??
import java.net.URL;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlImage;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class urlscrap {
public static void main(String[] args) throws Exception
{
//WebClient webClient = new WebClient(Opera);
WebClient webClient = new WebClient();
HtmlPage currentPage = (HtmlPage) webClient.getPage(new URL("http://www.google.com"));
//get list of all divs
final List<?> images = currentPage.getByXPath("//img");
for (Object imageObject : images) {
HtmlImage image = (HtmlImage) imageObject;
System.out.println(image.getSrcAttribute());
}
//webClient.closeAllWindows();
}
}
Upvotes: 5
Reputation: 2553
If you don't mind switching languages, then I would recommend Python's scrapy. It is the best framework I've used so far to scrape web content, including images (it can even create thumbnails for you automatically). Personally, I would not use java for such tasks.
Upvotes: 0
Reputation: 15063
Looks like you're getting the text of the page, which is indeed the first step. What's your question? Are you having a problem finding all the images referenced within the page? I recommend looking up how to do DOM parsing in Java, and use it to extract all the img tags from the page.
Upvotes: 0