Mattis
Mattis

Reputation: 5096

PHP Simple HTML DOM parser give faulty data

I'm using PHP Simple HTML DOM to parse a web page with the following HTML. Notice the extra </span>-tags in each <li>.

<li>
  <span class="name">
    <a href="">Link</a> asdasd
  </span>
  </span>
</li>
<li>
  <span class="name">
    <a href="">Link</a> asdasd2
  </span>
  </span>
</li>

My queries are:

$lis = $dom->find('li');
foreach ($lis as $li) {
  $spans = $li->find('span');
  foreach ($spans as $span) {
    echo $span->plaintext."<br>";
  }
}

My output is:

Link asdasd 
Link asdasd2
-----------
Link asdasd2 
-----------

As you can see the find('span') finds two spans as children to the first <li> and getting the value from the next <span> it can find (even though it's a child of the next <li>). Removing the trailing </span> fixes the problem.

My questions are:

  1. Why is this happening?

  2. How I can solve this particular case? Everything else works well and I'm not in a position to make big changes to my script. I can change the DOM queries easily though if needed.

I am thinking about counting start and closing tags and stripping one </span> if there are too many of them. Since they will always be <span>s, are there a smart way to check it with regexp?

Upvotes: 0

Views: 339

Answers (2)

pguardiario
pguardiario

Reputation: 54984

1) Simple is trying to fix your extra </span> by adding a <span> somewhere. So now you have an extra span that shouldn't be there. For the record, DomDocument would do the same thing, although perhaps in a more predictable way.

2) Simplify:

foreach ($dom->find('li > span') as $span) {
  echo $span->plaintext."<br>";
}
//     Link asdasd    <br>     Link asdasd2    <br>

Now you've told it you only want the span that is a child of a li. Even better, do something like:

foreach ($dom->find('span.name') as $span) {
  echo $span->plaintext."<br>";
}

Use those attributes, that's what they're good for.

Upvotes: 1

Loek Bergman
Loek Bergman

Reputation: 2195

$newTxt = preg_replace('/\<\/span\>[\S]*\<\/span\>/','</span>',$txt);

The method 'find(x)' is an overloaded function that can return the equivalents of:

$e->getElementById(x);
$e->getElementsById(x);
$e->getElementByTagName(x); and
$e->getElementsByTagName(x);

In your first call makes it use of the last call. In the second $li of the third possibility. It is probably a method of optimization which question you were asking according to the API. I guess you have found a bug in the API, because you were asking in both cases the use of the third call:

$e->getElementByTagName();

Upvotes: 1

Related Questions