dynamitem
dynamitem

Reputation: 1669

Parsing with Jsoup in arraylist

How could I parse this with jsoup?

    <!-- NOVINEEE -->
<div class="right_naslov"><a href="/e-novine">e-novine</a></div>

  <div class="right_post">
    <span class="right_post_nadnaslov"><font class="nadnaslov">Zanimljiv zadatak</font></span><span class="right_post_datum"><font class="datum">12.12.2014.</font></span>
    <span class="right_post_naslov_v"><font class="naslov"><a href="/e-novine/n/?id=340">Profesor učenicima zadao najbolji zadatak ikad!</a></font></span>
    <span class="right_post_podnaslov"><font class="podnaslov"></font></span>
    <div class="right_post_tekst"><a href="/e-novine/n/?id=340"><img width="180" align="left" class="novine_slika_thumbm" border="0" src="/fajlovi/slike/thumbm/4161-zadatak_naslovna.jpg" /></a><p>72-godi&scaron;nji profesor biv&scaron;im učenicima iz godine u godinu &scaron;alje pisma &scaron;to nije lak zadatak jer mnogi ne žive u istoj državi. Iako radi ne&scaron;to stvarno posebno, Bruce sebe i dalje smatra prosječnim profesorom. Učenici ipak smatraju suprotno...</p>
<div>&nbsp;</div></div>
    </div>
</div>

I'd like to get the content of right_naslov, and inside the font class of nadnaslov, naslov, and the img src and the a href of right_post_tekst.

I tried doing something like this:

Document doc = Jsoup.connect(url).get();
            Elements post = doc.select("right_naslov right_post nadnaslov");
            HashMap<String, String> map = new HashMap<String, String>();

            map.put("rank", post.text());
            // Get the second td
            map.put("country", post.text());
            // Get the third td
            map.put("population", post.text());

            // Set all extracted Jsoup Elements into the array
            arraylist.add(map);

And afterwards I do:

resultp = data.get(position);

// Locate the TextViews in listview_item.xml
rank = (TextView) itemView.findViewById(R.id.rank);
country = (TextView) itemView.findViewById(R.id.country);
population = (TextView) itemView.findViewById(R.id.population);


// Capture position and set results to the TextViews
rank.setText(resultp.get(PocetnaFragment.RANK));
country.setText(resultp.get(PocetnaFragment.COUNTRY));
population.setText(resultp.get(PocetnaFragment.POPULATION));

I've been following this tutorial: http://www.androidbegin.com/tutorial/android-jsoup-listview-images-texts-html-tables-tutorial/

There are multiple right_posts

Thanks


UPDATE

Getting the following error after the answer that is located down (after all the comments):

02-14 23:50:17.490    2469-2530/gimbi.edu.ba W/System.err﹕ java.lang.NullPointerException: Attempt to invoke virtual method 'org.jsoup.select.Elements org.jsoup.nodes.Element.select(java.lang.String)' on a null object reference
02-14 23:50:17.494    2469-2530/gimbi.edu.ba W/System.err﹕ at gimbi.edu.ba.PocetnaFragment$JsoupListView.doInBackground(PocetnaFragment.java:147)
02-14 23:50:17.494    2469-2530/gimbi.edu.ba W/System.err﹕ at gimbi.edu.ba.PocetnaFragment$JsoupListView.doInBackground(PocetnaFragment.java:103)
02-14 23:50:17.494    2469-2530/gimbi.edu.ba W/System.err﹕ at android.os.AsyncTask$2.call(AsyncTask.java:288)
02-14 23:50:17.494    2469-2530/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.FutureTask.run(FutureTask.java:237)
02-14 23:50:17.494    2469-2530/gimbi.edu.ba W/System.err﹕ at android.os.AsyncTask$SerialExecutor$1.run(AsyncTask.java:231)
02-14 23:50:17.494    2469-2530/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1112)
02-14 23:50:17.494    2469-2530/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
02-14 23:50:17.494    2469-2530/gimbi.edu.ba W/System.err﹕ at java.lang.Thread.run(Thread.java:818)
02-14 23:50:46.713    2469-4859/gimbi.edu.ba W/System.err﹕ java.lang.NullPointerException: Attempt to invoke virtual method 'org.jsoup.select.Elements org.jsoup.nodes.Element.select(java.lang.String)' on a null object reference
02-14 23:50:46.718    2469-4859/gimbi.edu.ba W/System.err﹕ at gimbi.edu.ba.PocetnaFragment$JsoupListView.doInBackground(PocetnaFragment.java:147)
02-14 23:50:46.718    2469-4859/gimbi.edu.ba W/System.err﹕ at gimbi.edu.ba.PocetnaFragment$JsoupListView.doInBackground(PocetnaFragment.java:103)
02-14 23:50:46.719    2469-4859/gimbi.edu.ba W/System.err﹕ at android.os.AsyncTask$2.call(AsyncTask.java:288)
02-14 23:50:46.719    2469-4859/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.FutureTask.run(FutureTask.java:237)
02-14 23:50:46.719    2469-4859/gimbi.edu.ba W/System.err﹕ at android.os.AsyncTask$SerialExecutor$1.run(AsyncTask.java:231)
02-14 23:50:46.719    2469-4859/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1112)
02-14 23:50:46.719    2469-4859/gimbi.edu.ba W/System.err﹕ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
02-14 23:50:46.719    2469-4859/gimbi.edu.ba W/System.err﹕ at java.lang.Thread.run(Thread.java:818)

As I said down there, I tried deleting the img element, but it's same for all elements when I call map.put methods.

Upvotes: 0

Views: 1364

Answers (1)

Wilts C
Wilts C

Reputation: 1750

Have a read on this link regarding how Jsoup is used to extract data.

Following is my example according to your scenario.

    Document doc = null;
    Element aEle = null;
    Element fontEle = null;

    try {
        doc = ......

        /** Get A tag that is under DIV with classname right_naslov **/
        aEle = doc.select("div.right_naslov > a").first();
        if (aEle != null) {
            System.out.println("right_naslov content: " + aEle.ownText());
        }

        /** Get Font tag with [classname=nadnaslov] under span[classname=right_post_nadnaslov] under div[lassname=right_post]  **/
        /** Try to get Font[classname=naslov] with the following method **/
        fontEle = doc.select("div.right_post > span.right_post_nadnaslov > font.nadnaslov").first();
        if (fontEle != null) {
            System.out.println("font nadnaslov content: " + fontEle.ownText());
        }

        /** Get A tag that is under div[classname=right_post_tekst] under div[classname=right_post] **/
        aEle = doc.select("div.right_post > div.right_post_tekst > a").first();
        if (aEle != null) {
            System.out.println("a href: " + aEle.attr("href"));

            /** Get inner IMG tag with classname as 'novine_slika_thumbm' **/
            Element imgEle = aEle.select("img.novine_slika_thumbm").first();
            if (imgEle != null) {
                System.out.println("img src: " + imgEle.attr("src"));
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

Above example will only works if there is only one DIV[classname=right_naslov] or DIV[classname=right_post] in the HTML document that you are parsing as I use Elements.first() in extracting data, which means I always select the first Element that meets our extract criteria. Try to play around with Jsoup, have fun. Once you get the all the data, then store them in either Hashmap or ArrayList as you like.


Updated

What you could do is select multiple DIVs[classname=right_post] with Document.select() which returns you Elements object. Then loop each Element to get its inner data. In my following example, you will get have two HashMap items in arraylist variable.

There are 2 div[classname=right_naslov] and I only retrieve the second one which after <!-- NOVINEEE --> comment section. There are 5 div[classname=right_post] and I already ignored those that are without inner element span[classname=right_post_nadnaslov].

    List<HashMap<String, String>> arraylist = new ArrayList<HashMap<String, String>>();
    Elements aEles = null;
    Elements divRightPostEles = null;
    String rightNaslov = null;
    Document doc = null;

    try {
        doc = Jsoup.connect(url).get();

        /** Get A tag that is under DIV with classname right_naslov **/
        aEles = doc.select("div.right_naslov > a");
        if (aEles != null && aEles.size() > 0) {
            if (aEles.size() == 2)
                rightNaslov = aEles.get(1).ownText();
            else
                rightNaslov = aEles.first().ownText();
        }

        /**
         * Since you say there are multiple DIV with right_post as
         * classname, we will get all those right post elements and loop
         * them one by one to retrieve its inner elements
         **/
        divRightPostEles = doc.select("div.right_post");

        for (Element rightPostDiv : divRightPostEles) {
            /** Each loop of this represents a right_post DIV element **/

            HashMap<String, String> map = new HashMap<String, String>();

            /**
             * Get Font tag with [classname=nadnaslov] under
             * span[classname=right_post_nadnaslov] under
             * div[lassname=right_post]
             **/
            /** Try to get Font[classname=naslov] with the following method **/
            Elements fontNadnaslov = rightPostDiv
                    .select("span.right_post_nadnaslov > font.nadnaslov");

            /**
             * Get A tag that is under div[classname=right_post_tekst] under
             * div[classname=right_post]
             **/
            Element aRightPostTekst = rightPostDiv.select(
                    "div.right_post_tekst > a[href]").first();

            // Retrive Jsoup Elements
            if (fontNadnaslov != null && fontNadnaslov.size() > 0) {
                map.put("country", fontNadnaslov.first().ownText());

                if (aRightPostTekst != null) {
                    map.put("population", aRightPostTekst.attr("href"));

                    Element img = aRightPostTekst.select("img[src]").first();

                    if (img != null)
                        map.put("image", img.attr("src"));
                }

                if (rightNaslov != null)
                    map.put("rank", rightNaslov);
                // Set all extracted Jsoup Elements into the array
                arraylist.add(map);
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

Upvotes: 2

Related Questions