DBANerd
DBANerd

Reputation: 35

Extract Data from HTML using JSoup

I am writing a script to extract data from a HTML Document. Here is a part of the document.

<div class="info">
<div id="info_box" class="inf_clear">
    <div id="restaurant_info_box_left">
        <table id="rest_logo">
            <tr>
                <td>
                    <a itemprop="url" title="XYZ" href="XYZ.com">
                        <img src="/files/logo/26721.jpg" alt="XYZ" title="XYZ" width="100" />
                    </a>
                </td>
            </tr>
        </table>
        <h1 id="Name"><a class="fn org url" rel="Order Online" href="XYZ.com" title="XYZ" itemprop="name">XYZ</a></h1>

        <div class="rest_data" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">

            <span itemprop="telephone">(305) 535-1379</span> | <b>
            <span itemprop="streetAddress">1755 Alton Rd</span>,
            <span itemprop="addressLocality">Miami Beach</span>,
            <span itemprop="addressRegion">FL</span>
            <span itemprop="postalCode">33139</span></b>
        </div>
        <div class="geo">
            <span class="latitude" title="25.792588"></span>
            <span class="longitude" title="-80.141214"></span>
        </div>
        <div class="rest_data">Estimated delivery time: <b>45-60 min</b></div>
    </div>

</div>

I am using Jsoup and not quite sure how to achieve this.

There are many div tags in the document and I try to match with their unique attribute. Say for div tag with class attribute value as "info"

   Elements divs = doc.select("div");

        for (Element div : divs) {
            String divClass = div.attr("class").toString();
            if (divClass.equalsIgnoreCase("rest_info")) {
}

If matched, I have to get the table with id "rest_logo" inside that divtag.

When doc.select("table") is used, it looks like the parser searches the entire document.

What I need to achieve is, if the div tag attribute is matched, I need to fetch the elements and attributes inside the matched div tag.

Expected Output: 

Name : XYZ

telephone:(305) 535-1379

streetAddress:1755 Alton Rd

addressLocality:Miami Beach

addressRegion:FL

postalCode:33139

latitude:25.792588

longitude:-80.141214

Estimated delivery time:45-60 min

Any Ideas?

Upvotes: 1

Views: 1543

Answers (3)

Roamer-1888
Roamer-1888

Reputation: 19288

Probably the main thing to realise is that an element with an id can be selected directly - no need to loop through a collection of elements searching for it.

I've not used JSoup and my Java is very rusty but here goes ...

// 1. Select elements from document
Element container = doc.select("#restaurant_info_box_left"); // find element in document with id="restaurant_info_box_left"
Element h1 = container.select("h1"); // find h1 element in container
Elements restData = container.select(".rest_data"); //find all divs in container with class="rest_data"
Element restData_0 = restData.get(0); // find first rest_data div
Element restData_1 = restData.get(1); // find second rest_data div
Elements restData_0_spans = restData_0.select("span"); // find first rest_data div's spans
Elements geos = container.select(".geo"); // find all divs in container with class="geo"
Element geo = geos.get(0); // find first .geo div
Elements geo_spans = geo.select("span"); // find first .geo div's spans

// 2. Compose output

// h1 text
String text = "Name: " + h1.text();
// output text >>>

// restData_0_spans text
for (Element span : restData_0_spans) {
    String text = span.attr("itemprop").toString() + ": " + span.text();
    // output text >>>
}

// geo data
for (Element span : geo_spans) {
    String text = span.attr("class").toString() + ": " + span.attr("title").toString();
    // output text >>>
}

// restData_1 text
String text = restData_1.text();
// output text >>>

For someone used to JavaScript/jQuery, this all seems very laboured. With luck it may simplify somewhat.

Upvotes: 0

Jonas Czech
Jonas Czech

Reputation: 12328

Here's how I would do it:

Document doc = Jsoup. parse(myHtml);

Elements elements = doc.select("div.info")
    .select(”a[itemprop=url], span[itemprop=telephone], span[itemprop=streetAddress], span[itemprop=addressLocality], span[itemprop=addressRegion], span[itemprop=postalCode], span.longitude, span.latitude”);
elements.add(doc.select("div.info > div.rest_data").last());

for (Element e:elements) {
   if (e.hasAttr("itemprop”)) {
       System.out.println(e.attr("itemprop") + e.text());
    }
    if (e.hasAttr("itemprop”) && e.attr("itemprop").equals ("url")) {
        System.out.println("name: " + e.attr("title"));
    }

    if (e.attr("class").equals("longitude") || e.attr("class").equals("latitude")) {
        System.out. println(e.attr("class") + e.attr("title"));
    }

    if (e.attr("class").equals("rest_data")) {
        System.out.println(e.text());
    }
}

(Note: I wrote this on my phone, so untested, but it should work, may also contain typos)

A bit of explanation: First get all the desired elements via doc.select(...), and then extract the desired data from each one.

Let me know if it works.

Upvotes: 0

user4910279
user4910279

Reputation:

    for (Element e : doc.select("div.info")) {
        System.out.println("Name: " + e.select("a.fn").text());
        System.out.println("telephone: " + e.select("span[itemprop=telephone]").text());
        System.out.println("streetAddress: " + e.select("span[itemprop=streetAddress]").text());
        // .....
    }

Upvotes: 1

Related Questions