Sourav
Sourav

Reputation: 17530

Extract data from HTML table row column

How to extract data from HTML table in PHP. The data is in this format

Table 1

<tr><td class="body" valign="top"><a href="example"><b>DATA</b></a></td><td class="body" valign="top">Data_Text</td></tr>

Table 2

<tr><th><div id="Data">Data</div></th><td>Data_Text_1</td><td>Data_Text_2</td></tr>

Table 3

<tr><td width="120"><a href="example" target="_blank">DATA</a></td><td>Data_Text</td></tr>

I want to get the Data & Data_Text or (Data_Text_1 & Data_Text_2) from the 3 tables.
I've used

$html = file_get_contents($link);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes  = $xpath->query('//td[]');
$nodes2 = $xpath->query('//td[]');

But it cant show any data !

I'll offer bounty for this question on day after tomorrow

Upvotes: 1

Views: 28406

Answers (3)

pdizz
pdizz

Reputation: 4240

Using simplehtmldom.php...

<?php

include 'simple_html_dom.php';

$html = file_get_html('thetable.html');

$rows = $html->find('tr');
foreach($rows as $row) {
    echo $row->plaintext;
}

?>

or use 'td'...

<?php

include 'simple_html_dom.php';

$html = file_get_html('thetable.html');

$cells = $html->find('td');
foreach($cells as $cell) {
    echo $cell->plaintext;
}

?>

Upvotes: 1

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243599

Use this single XPath expression:

/*/table/tr//text()[normalize-space()]

This selects any text-node that consists not only odf white-space characters and that is a descendant of any tr element that is a child of a table element that is a child of the top element of the document.

XSLT - based verification:

 <xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/table/tr//text()[normalize-space()]"/>

. . . . . . .
  <xsl:for-each select=
    "/*/table/tr//text()[normalize-space()]">
    "<xsl:copy-of select="."/>"
  </xsl:for-each>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied against the following XML document:

<html>
 <table>
    <tr>
        <td class="body" valign="top">
            <a href="example">
                <b>DATA</b>
            </a>
        </td>
        <td class="body" valign="top">Data_Text</td>
    </tr>
 </table>

 <table>
    <tr>
        <th>
            <div id="Data">Data</div>
        </th>
        <td>Data_Text_1</td>
        <td>Data_Text_2</td>
    </tr>
 </table>

 <table>
    <tr>
        <td width="120">
            <a href="example" target="_blank">DATA</a>
        </td>
        <td>Data_Text</td>
    </tr>
 </table>
</html>

the XPath expression is evaluated and the selected text nodes are output (twice -- once as the result of the evaluation and they appear concatenated, the second time each selected node is output on a separate line and surrounded by quotes):

DATAData_TextDataData_Text_1Data_Text_2DATAData_Text

. . . . . . .

"DATA"

"Data_Text"

"Data"

"Data_Text_1"

"Data_Text_2"

"DATA"

"Data_Text"

Upvotes: 0

Nicol&#225;s Ozimica
Nicol&#225;s Ozimica

Reputation: 9758

Given an HTML document called xpathTables.html like this:

<html>
  <body>
    <table>
      <tbody>
        <tr><td class="body" valign="top"><a href="example"><b>DATA</b></a></td><td class="body" valign="top">Data_Text</td></tr>
      </tbody> 
    </table>

    <table>
      <tbody>
        <tr><th><div id="Data">Data</div></th><td>Data_Text_1</td><td>Data_Text_2</td></tr>
      </tbody>
    </table>

    <table>
      <tbody>
        <tr><td width="120"><a href="example" target="_blank">DATA</a></td><td>Data_Text</td></tr>
      </tbody>
    </table>
  </body>
</html>

And this PHP script:

<?php

$link = "xpathTables.html";

$html = file_get_contents($link);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tables = $doc->getElementsByTagName('table');

$nodes  = $xpath->query('.//tbody/tr/td/a/b', $tables->item(0));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td[@class="body"]', $tables->item(0));
var_dump($nodes->item(1)->nodeValue);

$nodes  = $xpath->query('.//tbody/tr/th/div[@id="Data"]', $tables->item(1));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td', $tables->item(1));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td', $tables->item(1));
var_dump($nodes->item(1)->nodeValue);

$nodes  = $xpath->query('.//tbody/tr/td/a', $tables->item(2));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td', $tables->item(2));
var_dump($nodes->item(1)->nodeValue);

You get this output:

string(4) "DATA"
string(9) "Data_Text"
string(4) "Data"
string(11) "Data_Text_1"
string(11) "Data_Text_2"
string(4) "DATA"
string(9) "Data_Text"

I didn't understood well your question, so I made this example in order to show all the text nodes your tables had. If you are only interested in some of those nodes, you should pick the XPath queries that do the job.

I included the tags table and tbody, just to make the example more HTML like.

Upvotes: 0

Related Questions