Reputation: 499
Following is the example amazon link i am trying to crawl for the image's width and height:
http://images.amazon.com/images/P/0099441365.01.SCLZZZZZZZ.jpg
I am using jsoup and following is my code:
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Crawler_main {
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
String filepath = "C:/imagelinks.txt";
try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {
String line;
String width;
//String height;
while ((line = br.readLine()) != null) {
// process the line.
System.out.println(line);
Document doc = Jsoup.connect(line).ignoreContentType(true).get();
//System.out.println(doc.toString());
Elements jpg = doc.getElementsByTag("img");
width = jpg.attr("width");
System.out.println(width);
//String title = doc.title();
}
}
catch (FileNotFoundException ex){
System.out.println("File not found");
}
catch(IOException ex){
System.out.println("Unable to read line");
}
catch (Exception ex){
System.out.println("Exception occured");
}
}
}
The html is fetched but when I extract the width attribute, it returns a null. When I printed the html which was fetched, it contains garbadge characters (i am guessing its the actual image information which I am calling garbadge characters. For example:
I cant even paste the document.toString() result in this editor. Help!
Upvotes: 0
Views: 744
Reputation: 123
The problem is that you're fetching the jpg file, not any HTML. The call to ignoreContentType(true) provides a clue, as its documentation states:
Ignore the document's Content-Type when parsing the response. By default this is false, an unrecognised content-type will cause an IOException to be thrown. (This is to prevent producing garbage by attempting to parse a JPEG binary image, for example.)
If you want to obtain the width of the actual jpg file, this SO answer may be of use:
BufferedImage bimg = ImageIO.read(new File(filename));
int width = bimg.getWidth();
int height = bimg.getHeight();
Upvotes: 1