PHP Simple HTML DOM parser give faulty data

Question

I'm using PHP Simple HTML DOM to parse a web page with the following HTML. Notice the extra -tags in each

.


  
    Link asdasd
  
  


  
    Link asdasd2

My queries are:

$lis = $dom->find('li');
foreach ($lis as $li) {
  $spans = $li->find('span');
  foreach ($spans as $span) {
    echo $span->plaintext."
";
  }
}

My output is:

Link asdasd 
Link asdasd2
-----------
Link asdasd2 
-----------

As you can see the find('span') finds two spans as children to the first

and getting the value from the next it can find (even though it's a child of the next

). Removing the trailing fixes the problem.

My questions are:

Why is this happening?
How I can solve this particular case? Everything else works well and I'm not in a position to make big changes to my script. I can change the DOM queries easily though if needed.

I am thinking about counting start and closing tags and stripping one if there are too many of them. Since they will always be s, are there a smart way to check it with regexp?

pguardiario · Accepted Answer

1) Simple is trying to fix your extra by adding a somewhere. So now you have an extra span that shouldn't be there. For the record, DomDocument would do the same thing, although perhaps in a more predictable way.

2) Simplify:

foreach ($dom->find('li > span') as $span) {
  echo $span->plaintext."
";
}
//     Link asdasd    
     Link asdasd2

Now you've told it you only want the span that is a child of a li. Even better, do something like:

foreach ($dom->find('span.name') as $span) {
  echo $span->plaintext."
";
}

Use those attributes, that's what they're good for.

PHP Simple HTML DOM parser give faulty data

Answers (2)

Related Questions