Reputation: 1669
How could I parse this with jsoup?
<!-- NOVINEEE -->
<div class="right_naslov"><a href="/e-novine">e-novine</a></div>
<div class="right_post">
<span class="right_post_nadnaslov"><font class="nadnaslov">Zanimljiv zadatak</font></span><span class="right_post_datum"><font class="datum">12.12.2014.</font></span>
<span class="right_post_naslov_v"><font class="naslov"><a href="/e-novine/n/?id=340">Profesor učenicima zadao najbolji zadatak ikad!</a></font></span>
<span class="right_post_podnaslov"><font class="podnaslov"></font></span>
<div class="right_post_tekst"><a href="/e-novine/n/?id=340"><img width="180" align="left" class="novine_slika_thumbm" border="0" src="/fajlovi/slike/thumbm/4161-zadatak_naslovna.jpg" /></a><p>72-godišnji profesor bivšim učenicima iz godine u godinu šalje pisma što nije lak zadatak jer mnogi ne žive u istoj državi. Iako radi nešto stvarno posebno, Bruce sebe i dalje smatra prosječnim profesorom. Učenici ipak smatraju suprotno...</p>
<div> </div></div>
</div>
</div>
I'd like to get the content of right_naslov, and inside the font class of nadnaslov, naslov, and the img src and the a href of right_post_tekst.
I tried doing something like this:
Document doc = Jsoup.connect(url).get();
Elements post = doc.select("right_naslov right_post nadnaslov");
HashMap<String, String> map = new HashMap<String, String>();
map.put("rank", post.text());
// Get the second td
map.put("country", post.text());
// Get the third td
map.put("population", post.text());
// Set all extracted Jsoup Elements into the array
arraylist.add(map);
And afterwards I do:
resultp = data.get(position);
// Locate the TextViews in listview_item.xml
rank = (TextView) itemView.findViewById(R.id.rank);
country = (TextView) itemView.findViewById(R.id.country);
population = (TextView) itemView.findViewById(R.id.population);
// Capture position and set results to the TextViews
rank.setText(resultp.get(PocetnaFragment.RANK));
country.setText(resultp.get(PocetnaFragment.COUNTRY));
population.setText(resultp.get(PocetnaFragment.POPULATION));
I've been following this tutorial: http://www.androidbegin.com/tutorial/android-jsoup-listview-images-texts-html-tables-tutorial/
There are multiple right_posts
Thanks
UPDATE
Getting the following error after the answer that is located down (after all the comments):
02-14 23:50:17.490 2469-2530/gimbi.edu.ba W/System.err﹕ java.lang.NullPointerException: Attempt to invoke virtual method 'org.jsoup.select.Elements org.jsoup.nodes.Element.select(java.lang.String)' on a null object reference
02-14 23:50:17.494 2469-2530/gimbi.edu.ba W/System.err﹕ at gimbi.edu.ba.PocetnaFragment$JsoupListView.doInBackground(PocetnaFragment.java:147)
02-14 23:50:17.494 2469-2530/gimbi.edu.ba W/System.err﹕ at gimbi.edu.ba.PocetnaFragment$JsoupListView.doInBackground(PocetnaFragment.java:103)
02-14 23:50:17.494 2469-2530/gimbi.edu.ba W/System.err﹕ at android.os.AsyncTask$2.call(AsyncTask.java:288)
02-14 23:50:17.494 2469-2530/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.FutureTask.run(FutureTask.java:237)
02-14 23:50:17.494 2469-2530/gimbi.edu.ba W/System.err﹕ at android.os.AsyncTask$SerialExecutor$1.run(AsyncTask.java:231)
02-14 23:50:17.494 2469-2530/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1112)
02-14 23:50:17.494 2469-2530/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
02-14 23:50:17.494 2469-2530/gimbi.edu.ba W/System.err﹕ at java.lang.Thread.run(Thread.java:818)
02-14 23:50:46.713 2469-4859/gimbi.edu.ba W/System.err﹕ java.lang.NullPointerException: Attempt to invoke virtual method 'org.jsoup.select.Elements org.jsoup.nodes.Element.select(java.lang.String)' on a null object reference
02-14 23:50:46.718 2469-4859/gimbi.edu.ba W/System.err﹕ at gimbi.edu.ba.PocetnaFragment$JsoupListView.doInBackground(PocetnaFragment.java:147)
02-14 23:50:46.718 2469-4859/gimbi.edu.ba W/System.err﹕ at gimbi.edu.ba.PocetnaFragment$JsoupListView.doInBackground(PocetnaFragment.java:103)
02-14 23:50:46.719 2469-4859/gimbi.edu.ba W/System.err﹕ at android.os.AsyncTask$2.call(AsyncTask.java:288)
02-14 23:50:46.719 2469-4859/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.FutureTask.run(FutureTask.java:237)
02-14 23:50:46.719 2469-4859/gimbi.edu.ba W/System.err﹕ at android.os.AsyncTask$SerialExecutor$1.run(AsyncTask.java:231)
02-14 23:50:46.719 2469-4859/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1112)
02-14 23:50:46.719 2469-4859/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
02-14 23:50:46.719 2469-4859/gimbi.edu.ba W/System.err﹕ at java.lang.Thread.run(Thread.java:818)
As I said down there, I tried deleting the img element
, but it's same for all elements when I call map.put
methods.
Upvotes: 0
Views: 1364
Reputation: 1750
Have a read on this link regarding how Jsoup is used to extract data.
Following is my example according to your scenario.
Document doc = null;
Element aEle = null;
Element fontEle = null;
try {
doc = ......
/** Get A tag that is under DIV with classname right_naslov **/
aEle = doc.select("div.right_naslov > a").first();
if (aEle != null) {
System.out.println("right_naslov content: " + aEle.ownText());
}
/** Get Font tag with [classname=nadnaslov] under span[classname=right_post_nadnaslov] under div[lassname=right_post] **/
/** Try to get Font[classname=naslov] with the following method **/
fontEle = doc.select("div.right_post > span.right_post_nadnaslov > font.nadnaslov").first();
if (fontEle != null) {
System.out.println("font nadnaslov content: " + fontEle.ownText());
}
/** Get A tag that is under div[classname=right_post_tekst] under div[classname=right_post] **/
aEle = doc.select("div.right_post > div.right_post_tekst > a").first();
if (aEle != null) {
System.out.println("a href: " + aEle.attr("href"));
/** Get inner IMG tag with classname as 'novine_slika_thumbm' **/
Element imgEle = aEle.select("img.novine_slika_thumbm").first();
if (imgEle != null) {
System.out.println("img src: " + imgEle.attr("src"));
}
}
} catch (Exception e) {
e.printStackTrace();
}
Above example will only works if there is only one DIV[classname=right_naslov]
or DIV[classname=right_post]
in the HTML document that you are parsing as I use Elements.first()
in extracting data, which means I always select the first Element that meets our extract criteria. Try to play around with Jsoup, have fun. Once you get the all the data, then store them in either Hashmap
or ArrayList
as you like.
Updated
What you could do is select multiple DIVs[classname=right_post] with Document.select()
which returns you Elements
object. Then loop each Element
to get its inner data. In my following example, you will get have two HashMap
items in arraylist
variable.
There are 2 div[classname=right_naslov] and I only retrieve the second one which after <!-- NOVINEEE -->
comment section. There are 5 div[classname=right_post] and I already ignored those that are without inner element span[classname=right_post_nadnaslov].
List<HashMap<String, String>> arraylist = new ArrayList<HashMap<String, String>>();
Elements aEles = null;
Elements divRightPostEles = null;
String rightNaslov = null;
Document doc = null;
try {
doc = Jsoup.connect(url).get();
/** Get A tag that is under DIV with classname right_naslov **/
aEles = doc.select("div.right_naslov > a");
if (aEles != null && aEles.size() > 0) {
if (aEles.size() == 2)
rightNaslov = aEles.get(1).ownText();
else
rightNaslov = aEles.first().ownText();
}
/**
* Since you say there are multiple DIV with right_post as
* classname, we will get all those right post elements and loop
* them one by one to retrieve its inner elements
**/
divRightPostEles = doc.select("div.right_post");
for (Element rightPostDiv : divRightPostEles) {
/** Each loop of this represents a right_post DIV element **/
HashMap<String, String> map = new HashMap<String, String>();
/**
* Get Font tag with [classname=nadnaslov] under
* span[classname=right_post_nadnaslov] under
* div[lassname=right_post]
**/
/** Try to get Font[classname=naslov] with the following method **/
Elements fontNadnaslov = rightPostDiv
.select("span.right_post_nadnaslov > font.nadnaslov");
/**
* Get A tag that is under div[classname=right_post_tekst] under
* div[classname=right_post]
**/
Element aRightPostTekst = rightPostDiv.select(
"div.right_post_tekst > a[href]").first();
// Retrive Jsoup Elements
if (fontNadnaslov != null && fontNadnaslov.size() > 0) {
map.put("country", fontNadnaslov.first().ownText());
if (aRightPostTekst != null) {
map.put("population", aRightPostTekst.attr("href"));
Element img = aRightPostTekst.select("img[src]").first();
if (img != null)
map.put("image", img.attr("src"));
}
if (rightNaslov != null)
map.put("rank", rightNaslov);
// Set all extracted Jsoup Elements into the array
arraylist.add(map);
}
}
} catch (Exception e) {
e.printStackTrace();
}
Upvotes: 2