Regular expressions: matching all alt attributes in an HTML file?

Question

I've been looking through the questions and got a better idea of my problem, but still, didn't find an answer.

I have a problem with regular expressions in PHP. I'm trying to get all the text in "alt" attributes of an HTML file. I'm taking into account all the possible tag names (img, input and area) and all kind of eventualities, like spaces and line breaks inbetween the characters (like ). It must also be aware that the match string can be enclosed by single or double quotes and contain other (different) quote marks inside, for example: or, .

This is becoming difficult to me (I'm a beginner with regular expressions) so I'll just show you what I got. Note that I'm trying to use a backrefernce inside a character class, which I found to be a wrong practice (or so I think).

'/<\s*(?:img|input|area)\s[^>]*alt\s*=\s*("|\')([^\1>]*)\1[^>]*>/siU'

I've also seen in StackOverflow, some people recommending HTML parsers for stuff like this, but I'm worried about how much resources this practice may consume. Would you think this is a better idea? Thank you!

Kurt McKee · Accepted Answer

Absolutely you should use a parser. There are several reasons for this:

An HTML parser library can account for broken (or otherwise malformed) HTML that a regular expression will miss; for instance, some webpages will fail to escape quotes embedded in the alt attribute, such as alt='why can't I do this'
Parsers will be able to handle escaped characters automatically; for instance, alt="why the long space"
Additionally, it's probable that an HTML parser will offer speed and API advantages

You can perhaps check out the StackOverflow question Robust, Mature HTML Parser for PHP for some suggestions about what parsers would be worthwhile to use.

Regular expressions: matching all alt attributes in an HTML file?

Answers (2)

Related Questions