Reputation: 5096
I'm using PHP Simple HTML DOM to parse a web page with the following HTML. Notice the extra </span>
-tags in each <li>
.
<li>
<span class="name">
<a href="">Link</a> asdasd
</span>
</span>
</li>
<li>
<span class="name">
<a href="">Link</a> asdasd2
</span>
</span>
</li>
My queries are:
$lis = $dom->find('li');
foreach ($lis as $li) {
$spans = $li->find('span');
foreach ($spans as $span) {
echo $span->plaintext."<br>";
}
}
My output is:
Link asdasd
Link asdasd2
-----------
Link asdasd2
-----------
As you can see the find('span')
finds two spans as children to the first <li>
and getting the value from the next <span>
it can find (even though it's a child of the next <li>
). Removing the trailing </span>
fixes the problem.
My questions are:
Why is this happening?
How I can solve this particular case? Everything else works well and I'm not in a position to make big changes to my script. I can change the DOM queries easily though if needed.
I am thinking about counting start and closing tags and stripping one </span>
if there are too many of them. Since they will always be <span>
s, are there a smart way to check it with regexp?
Upvotes: 0
Views: 339
Reputation: 54984
1) Simple is trying to fix your extra </span>
by adding a <span>
somewhere. So now you have an extra span that shouldn't be there. For the record, DomDocument
would do the same thing, although perhaps in a more predictable way.
2) Simplify:
foreach ($dom->find('li > span') as $span) {
echo $span->plaintext."<br>";
}
// Link asdasd <br> Link asdasd2 <br>
Now you've told it you only want the span
that is a child of a li
. Even better, do something like:
foreach ($dom->find('span.name') as $span) {
echo $span->plaintext."<br>";
}
Use those attributes, that's what they're good for.
Upvotes: 1
Reputation: 2195
$newTxt = preg_replace('/\<\/span\>[\S]*\<\/span\>/','</span>',$txt);
The method 'find(x)' is an overloaded function that can return the equivalents of:
$e->getElementById(x);
$e->getElementsById(x);
$e->getElementByTagName(x); and
$e->getElementsByTagName(x);
In your first call makes it use of the last call. In the second $li of the third possibility. It is probably a method of optimization which question you were asking according to the API. I guess you have found a bug in the API, because you were asking in both cases the use of the third call:
$e->getElementByTagName();
Upvotes: 1