willdanceforfun
willdanceforfun

Reputation: 11240

Why doesn't this simple_html_dom selector work when used in entirety but not when broken into smaller selectors?

I'm having a go scraping a page with simple_html_dom. On the page I'm scraping, there's a table with rows, and inside those, a bunch of cells. I'm wanting to get stuff in the third cell in each row. The cell in question doesn't have a class.

<tr class="thisrow">
  <td class="firstcell"><strong>1st</strong></td>
  <td class="secondcell">nothing in here</td>
  <td><strong>blah blah</strong></td>
  <td>something else</td>
</tr>

So to get started, I went straight for the third cell:

foreach($html->find('tr.thisrow td:nth-child(3)') as $thirdcell) {
    echo $thirdcell->innertext // this works, no problem!
}

But then I realised I needed some data in another cell in the row (td.firstcell). This cell has a class, so I thought best to loop through the rows, then use selectors within the context of that row:

foreach($html->find('tr.thisrow') as $row) {

    $thirdcell = $row->find('td:nth-child(3)');
    echo $thirdcell; // this is now empty

    $firstcell = $row->find('td.firstcell');
    echo $firstcell; // this works!

}

So as you can see, my nth-child selector suddenly inside the context of the row loop is not working. What am I missing?

Upvotes: 0

Views: 89

Answers (2)

trincot
trincot

Reputation: 350310

It is a limitation of simple html dom. Apparently it can deal with nth-child selectors, but only when the parent is in the tree below the node on which you apply find.

But it is a valid selector, as the equivalent JavaScript shows:

for (var row of [...document.querySelectorAll('tr.thisrow')]) {
    var thirdcell = row.querySelectorAll('td:nth-child(3)');
    console.log(thirdcell[0].textContent); // this works!
}
<table border=1>
<tr class="thisrow">
  <td class="firstcell"><strong>1st</strong></td>
  <td class="secondcell">nothing in here</td>
  <td><strong>blah blah</strong></td>
  <td>something else</td>
</tr>
</table>

As a workaround you could use the array index on the find('td') result:

foreach($html->find('tr.thisrow') as $row) {
    $thirdcell = $row->find('td');
    echo $thirdcell[2]; // this works
}

Or, alternatively with children, as td are direct children of tr:

foreach($html->find('tr.thisrow') as $row) {
    $thirdcell = $row->children();
    echo $thirdcell[2]; // this works
}

Upvotes: 2

Saeed M.
Saeed M.

Reputation: 2361

you can use children($int) method. $int start with 0.

try this :

$row = $html->find('tr.thisrow',0);

$firstcell = $row->children(2)->innertext;
$thirdcell = $row->children(0)->innertext;

also you have : first_child () , last_child(), parent(),next_sibling(),prev_sibling()

Upvotes: 1

Related Questions