dmuk
dmuk

Reputation: 485

Using PHP and xPath to extract clean table of text

I am using the below code to extract values from an HTML file. The code returns a block of text. I want to know how to improve the code and extract elements of this block of code into a clean table.

File:

<div class=class1>
    <a href="">txt1</a>
            <div class=lvl2>
                    <p>hello1</p>
            </div>
    <a href="">txt2</a>
            <div class=lvl2>
                    <p>hello2</p>
            </div>
</div>

Code:

$doc = new DOMDocument();
@$doc->loadHTMLFile('file.htm');

$xpath = new DOMXPath($doc);

$list = $xpath->evaluate("//div[contains(@class, 'class1')]");

foreach ($list as $element)
    {
      echo '<p>' . $element->nodeValue . PHP_EOL . '</p>';
    }

Desired output:

 txt1 | hello1
 txt2 | hello2

Upvotes: 1

Views: 236

Answers (2)

Dan King
Dan King

Reputation: 3580

Or, you could do it this way if you wanted to make sure you were outputting each table separately. It assumes ordering is maintained, which I don't think is always guaranteed with XML / XPath, but in practice it usually is with most implementations:

$doc = new DOMDocument();
$doc->loadHTMLFile('file.htm');

$xpath = new DOMXPath($doc);

$list = $xpath->evaluate("//div[contains(@class, 'class1')]");

foreach ($list as $element)
{
    $column1 = $xpath->query("//a", $element);
    $column2 = $xpath->query("//div/p", $element);

    for ($i = 0; $i < $column1->length; $i++) {
        echo $column1->item($i)->nodeValue . ' | ' . $column2->item($i)->nodeValue .  PHP_EOL;
    }
}

I've removed the @ error suppression from the loadHTMLFile method - I don't think you want to use that because if this fails you will get errors later on anyway, and leaving it out will make the cause of your problem more explicit.

Amended: here's another way you could structure the loop if you don't want to iterate separately over both columns. It may still fail though, if the numbers of rows in each column don't match in the html:

foreach ($list as $element)
{
    $column1 = $xpath->query("//a", $element);

    for ($i = 0; $i < $column1->length; $i++) {
        $field1 = $column1->item($i);
        $field2 = $xpath->query("following-sibling::div", $field1)->item(0);

        echo $field1->nodeValue . ' | ' . trim($field2->nodeValue) .  PHP_EOL;
    }
}

Upvotes: 1

Dan King
Dan King

Reputation: 3580

How about this?:

$doc = new DOMDocument();
@$doc->loadHTMLFile('file.htm');

$xpath = new DOMXPath($doc);

$list = $xpath->evaluate("//div[contains(@class, 'class1')]/a");

foreach ($list as $element)
{
    $nextElement = $element->nextSibling;
    while ($nextElement->nodeType != XML_ELEMENT_NODE) {
        $nextElement = $nextElement->nextSibling;
    }

    echo $element->nodeValue . ' | ' . trim($nextElement->nodeValue) .  PHP_EOL;
}

I wasn't quite sure why you wanted <p> as well as PHP_EOL, so I left those out, but you can put them back in where you need them.

Upvotes: 0

Related Questions