Reputation: 12993
I'm working with a small subset of mostly invalid HTML, and I need to extract a small piece of data. Given the fact that most of "markup" isn't valid, I don't think that loading everything into a DOM is a good option. Moreover, it seems like a lot of overhead for this simple case.
Here's an example of the markup that I have:
(a bunch of invalid markup here with unclosed tags, etc.)
<TD><span>Something (random text here)</span></TD>
(a bunch more invalid markup here with more unclosed tags.)
The <TD><span>Something (random text here)</span></TD>
portion does not repeat itself anywhere in the document, so I believe a simple regex would do the trick.
However, I'm terrible with regular expressions.
Should I use a regular expression? Is there a more simple way to do this? If possible, I'd just like to extract the text after Something, the (random text here) portion.
Thanks in advance!
Edit -
Exact example of the HTML (I've omitted the stuff prior, which is the invalid markup that the vendor uses. It's irrelevant for this example, I believe):
<div class="FormTable">
<TABLE>
<TR>
<TD colspan="2">In order to proceed with login operation please
answer on the security question below</TD>
</TR>
<TR>
<TD colspan="2"> </TD>
</TR>
<TR>
<TD><label class="FormLabel">Security Question</label></TD>
<TD><span>What is your city of birth?</span></TD>
</TR>
<TR>
<TD><label class="FormLabel">Answer</label></TD>
<TD><INPUT name="securityAnswer" class="input" type="password" value=""></TD>
</TR>
</TABLE>
</div>
Upvotes: 2
Views: 261
Reputation: 1
Try using the DOMDOcument::loadHTML()
method, it should suppress any validation errors associated with HTML.
Upvotes: 0
Reputation: 301
Use of DOM parser is not optimal in your situation. I strongly believe that you need SAX parser, it just extract parts of your document and send appropriate events to your handlers. This method allows to parse broken documents easily.
Examples: http://pear.php.net/package/XML_HTMLSax3 http://www.php.net/manual/en/example.xml-structure.php
Upvotes: 1
Reputation: 95522
If you're sure the opening and closing span tags are on a single line . . .
$ cat test.php
<?php
$subject = "(a bunch of invalid markup here with unclosed tags, etc.)
<TD><span>Something (random text here)</span></TD>
(a bunch more invalid markup here with more unclosed tags.)";
$pattern = '/<span>.*<\/span>/';
preg_match($pattern, $subject, $matches);
print_r($matches);
?>
$ php -f test.php
Array
(
[0] => <span>Something (random text here)</span>
)
If you're not confident that the span tags are on the same line, you can treat the html as a text file, and grep for the span tags.
$ grep '[</]span>' yourfile.html
Upvotes: 2
Reputation: 80384
You might read through this answer and the other two it cites. Tackling invalid HTML a bit at a time is actually something you’re apt to have easier luck with using regexes on than using full parsers.
Upvotes: 1