Runescapenoob
Runescapenoob

Reputation: 37

How to extract data from html table, and its inside table?

I have a html table structure with some data in the main table and some in the nested table inside a td element.

I just want the required 5 data (with ** xx ** indication) so I can export it to Excel as one single row.

<table cellpadding="2" cellspacing="0" width="100%" class="chart">
              <tr>
              <td>**Text 1**</td>         
                <td>
                  <table cellpadding="2" cellspacing="0">
                    <tr>
                      <td>some useless data</td>
                      <td>**Text 2**</td>
                    </tr>
                  </table>
                </td>
                <td>**Text 3**</td>
                <td>**Text 4**</td>
                <td>**Text 5**</td>
              </tr>
</table>

My Code is like this:

    for (Element row : excel.select("tr")) {
        // create row for each tag
        header = sheet.createRow(rowCount);
        // loop through all th tag
        Elements ths = row.select("th");
        int count = 0;
        for (Element element : ths) {
            // set header style
            cell = header.createCell(count);
            cell.setCellValue(element.text());
            cell.setCellStyle(headerStyle);
            count++;
        }
        // now loop through all td tag
        Elements tds = row.select("td");
        count = 0;
        for (Element element : tds) {
            if(!element.text().isEmpty()){
                cell = header.createCell(count);
                cell.setCellValue(element.text());
                count++;
                }
        }

The problem here is that the output was not as expected.

It looks like this in Excel:

  Row1:  Text 1 | Text 2 | useless data | Text 2 | Text 3 | Text 4 | Text 5 |
  Row2:  useless data | Text 2 |

Additional Information: tags are omitted for simplifying question.

What I want is

 Row1:  Text 1 | Text 2 | Text 3 | Text 4 | Text 5 |

Upvotes: 1

Views: 1922

Answers (1)

luksch
luksch

Reputation: 11712

1. Two rows

I guess excel is the document or the table. Anyway, when you select excel.select("tr") you also pick up the inner table tr. To prevent this, you need to make the css selector more specific. If I assume excel to be the Document, I can do this

Elements outerTrs = excel.select("table.chart>tbody>tr");

in the context of your code:

for (Element row : excel.select("table.chart>tbody>tr")) {

Explanation: Jsoup creates a tbody element inside a table if it is not present. With the selector I made sure only the direct child tr the elements of the outer table are selected I can do this, because I know the class name of the outer table and it seems unique.

2. Unexpected number of columns

This is due to the fact that your select row.select("td") statement picks up the td containing the inner table. if you want only tds with no child elements you could use this:

Elements tds = row.select("td");
count = 0;
for (Element element : tds) {
if(!element.text().isEmpty() && element.children().isEmpty()){
    count++;
    System.out.println("line "+count+" text = '"+element.text()+"'");
}

3. useless data

To get rid of this, you need to just filter it out. From your example it is not clear when useless data is present. Is it always the first td in the inner table? If so you can do this (full solution)

Document excel = Jsoup.parse(tab);

for (Element row : excel.select("table.chart>tbody>tr")) {
    Elements tds = row.select("td");
    int count = 0;

    Element junkTd = row.select("td table td").first();

    for (Element element : tds) {
        if(!element.text().isEmpty() 
                && element.children().isEmpty()
                && !element.equals(junkTd)){

            count++;
            System.out.println("line "+count+" text = '"+element.text()+"'");
        }
    }
}

Upvotes: 1

Related Questions