Reputation:
I'm building a web scraper using JSoup. I'm attempting to extract the title out of img class from the below HTML code.
<div id="insideScroll" class="grid slider desktop-view">
<ul class="ng-scope" ng-if="2 === selectedCategoryId">
<li class="" data-list-item="">
<span>
<a class="grid-col--subnav ng-isolate-scope" data-internal-referrer-link="hub nav" data-link-name="hub nav daughter" data-click-id="hub nav 2" href="/recipes/111/appetizers-and-snacks/beans-and-peas/?internalSource=hub nav&referringId=76&referringContentType=recipe hub&linkName=hub nav daughter&clickId=hub nav 2" target="_self">
<img class="" alt="Bean and Pea Appetizers" title="Bean and Pea Appetizers" src="http://images.media-allrecipes.com/userphotos/140x140/00/60/91/609167.jpg">
<span class="category-title">Bean and Pea Appetizers</span>
</a>
</span>
</li>
</div>
Here is a function of what I have but it doesn't seem to be working. I'm receiving a Null Pointer Exception when I run it, which from the stack trace I'm assuming is due to the lack of a name in the image class. I could extract the title from the span class also, but am also having trouble getting the text from it. Thank you for your help.
@Override
public ArrayList<String> parseDocForTitles(Document doc) {
ArrayList<String> titles = new ArrayList<>();
String title;
Element insideScroll = doc.getElementById("insideScroll");
Elements img = insideScroll.select("img.\"\"");
for(Element ttle : img){
title = ttle.attr("title");
out.println(title); //just for testing
titles.add(title);
}
return titles;
}
Below is the stack trace I'm receiving:
[-]ERROR: See Stack Trace
java.lang.NullPointerException
at Scraper.Appetizers.parseDocForTitles(Appetizers.java:35)
at Scraper.Driver.main(Driver.java:25)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Upvotes: 0
Views: 775
Reputation: 18923
This is what worked for me:
Document document;
try { //Get Document object after parsing the html from given url.
document = Jsoup.connect(yourURL).get();
//Get images from document object.
Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
//Iterate images and print image attributes.
for (Element image : images) {
System.out.println("Image Source: " + image.attr("title"));
}
} catch (IOException e) {
e.printStackTrace();
}
Upvotes: 2
Reputation: 2434
You just need to select the img Elements correctly.
Change this:
Elements img = insideScroll.select("img.\"\"");
To this:
Elements img = insideScroll.select("img");
And it should work.
Upvotes: 0