cdarwin
cdarwin

Reputation: 4291

Algorithm for building a representation of an HTML table

I need to parse an HTML table containing colspans and rowspans and build a representation of it.

Reading the HTML is not a problem, I'm using HTMLCleaner and XQuery with Saxon (Java).

But I'm looking for a good algorithm to build the table, as I don't understand the rules that are followed by the browsers for "difficult" cases.

For example, given the following table (where the rowspan is wrong)

<table border=1>
    <tr><td rowspan="3">1</td><td>2</td></tr>
    <tr><td>3</td></tr>
</table>

I apply the following algorithm:

1) for each tr 
    1.1) expand the colspan and rowspan of the cells in the current line
    1.2) create a new line if it doesn't already exist
    1.3) for each td add the elements to the line

i.e. (E is an empty cell)

newline->no line existing==no expansion
add line elements (1.3)
line1: 1 [tr=3], 2

newline->tr expansion (1.1)
line1: 1[tr=3], 2
line2: E
line3: E

add line elements (1.3)
line1: 1[tr=3], 2
line2: E, 3
line3: E

line3 has to be removed (Firefox renders only two lines), how can I know it?

I'm particularly interested in cases where the elements of an incomplete line are completed with those of the following one, like:

<tr><td>1</td><td>2</td><td>3</td></tr>
<tr><td>4</td><td>5</td></tr>
<tr><td>6</td></tr>

rendering: 1 2 3 
           4 5 6

I have a practical case: this file contains two TRs which are rendered as one even though they are two different TR. Why?

The lines are these (starting from line 129792) enter image description here

they are rendered as (inside the red rectangle)

enter image description here

How can I decide to enqueue elements to a previous line?

What rules do browsers follow for weird code?

I'm using Java, I understand also javascript and a little of PHP, but I I'm mainly interested in the algorithm to follow. I'd like to know if something already exists or to hear any suggestion.

What I want is to be able to output a text representation of the table like one rendered by a real browser.

Edit:

After I read xtratic answer, I read the HTML table processing model specification, but it doesn't seem to answer my question about when one must enqueue elements to the previous line, as in the practical case I described (and added in this edit). Indeed, the documents says "16 If current cell is the last td or th element child in the tr element being processed, then increase ycurrent by 1, abort this set of steps, and return to the algorithm above.". But not always it happens that we go to a new line when the last td is found.

What I'm interested more is when to combine different rows. I tried to enqueue TDs after the ones of the previous line when the number of TDs of the previous line is fewer than the maximum already found, but it doesn't work

Upvotes: 1

Views: 1193

Answers (1)

xtratic
xtratic

Reputation: 4699

Read the HTML table processing model specification to find out all you need to know about how to process HTML tables. (it's not easy)

Since you want to parse the form of an html table, I recommend writing your processor following the steps exactly as listed under §4.9.12.1 Forming a table (step 18 gets into processing rows). I'm quite sure this is how browsers do it as well. The steps are written in such a way to be as convenient as possible for translating into code for a processor so you should be able to follow it pretty literally. Once your processor is done you should have a table of cells (as it is defined) and then you do whatever you want with the table model you now have. I can't promise it will be easy but at least you'll have a step by step guide.


To be extra clear: there is no "combining rows" but there are cells that span multiple rows.

The algorithm for growing downward is what puts GENERALI SPA.. at the start of all those rows, and the data from the following <tr> elements is added into the next available cells on their respective rows.

GENERALI SPA... spans 4 rows, but it's first row is hidden since there's no other data on it, so it looks like it only covers 3.

<tr> <!-- row 1 (0px high) -->
    <!-- td spans from [1,1] to [1,4] -->
    <!-- this fills the first column of rows 1, 2, 3, and 4 -->
    <td rowspan="4">GENERALI SPA #1</td>
</tr>
<tr> <!-- row 2 -->
    <!-- col 1 is taken by the cell defined above -->
    <!-- td spans from [2,2] to [2,3] taking up col 2 of row 2 and 3 -->
    <td rowspan="2">GENERALI SPA #2</td>
    <td>Proprieta'</td> <!-- ... -->
</tr>
<tr> <!-- row 3 -->
    <!-- col 1 and 2 are taken by the cells defined above -->
    <td rowspan="1">Totale #1</td> <!-- ... -->
</tr>
<tr> <!-- row 4 -->
    <!-- col 1 is taken by the cell defined above -->
    <td colspan="2">Totale #2</td> <!-- ... -->
</tr>

The table without formatting or hiding would look like this:

   1                      2                     3             4
  +----------------------+---------------------+-------------+---
1 |         ...          |      (implied)         (implied)       <-- 0px high (hidden)
  +-                    -+---------------------+-------------+---
2 | "GENERALI SPA #1"    | "GENERALI SPA #2"   | "Proprieta" | ..
  +-                    -+-                   -+-------------+---
3 |         ...          |         ...         | "Totale #1" | ..
  +-                    -+---------------------+-------------+---
4 |         ...          | "Totale #2"               ...     | ..
  +----------------------+---------------------+-------------+---

This would essentially be the table model you get after parsing by following the process in the html spec.

I don't see much point in removing "incomplete" rows (define incomplete), let them stay in the table, they are essentially header rows coming before more data that they encompass, and they aren't really hurting anything, you can detect them easily enough.

However, if you really want to then you could remove rows that have no explicitly created cells other than cells that span into other rows. In the case of the table section above, you could remove row 1 because column 1 spans rows 1, 2, 3, and 4, and row 1 has no other explicitly created cells. Thus all the data of row 1 still exists in the cells the data spans ([[1,2], [1,3], [1,4]) and you can safely remove row 1.

As an extra example, when I change rowspan to 1, this data appears on its own row and the following <tr> data fills the available cells on their respective rows:

enter image description here


vvv less relevant info vvv

The older HTML 4.01 Specification, has a straight-forward example relating to your question:

The next example illustrates (with the help of table borders) how cell definitions that span more than one row or column affect the definition of later cells. Consider the following table definition:

<TABLE border="1">
<TR><TD>1 <TD rowspan="2">2 <TD>3
<TR><TD>4 <TD>6
<TR><TD>7 <TD>8 <TD>9
</TABLE>

As cell "2" spans the first and second rows, the definition of the second row will take it into account. Thus, the second TD in row two actually defines the row's third cell. Visually, the table might be rendered to a tty device as:

-------------
| 1 | 2 | 3 | 
----|   |----
| 4 |   | 6 |
----|---|----
| 7 | 8 | 9 |
-------------

Note that if the TD defining cell "6" had been omitted, an extra empty cell would have been added by the user agent to complete the row.

This related question and answer lists some libraries that can help you in scraping the tables, but I don't believe this answer would handle the "difficult" cases since it's assuming that the occurrence of the <td> element corresponds exactly to its cell index in the table.

Upvotes: 3

Related Questions