Reputation: 304
I've been looking through the questions and got a better idea of my problem, but still, didn't find an answer.
I have a problem with regular expressions in PHP. I'm trying to get all the text in "alt" attributes of an HTML file. I'm taking into account all the possible tag names (img, input and area) and all kind of eventualities, like spaces and line breaks inbetween the characters (like <img alt = "Hello">
). It must also be aware that the match string can be enclosed by single or double quotes and contain other (different) quote marks inside, for example: <img alt="Alan's picture">
or, <img alt='Example for the word "hello" in the text'>
.
This is becoming difficult to me (I'm a beginner with regular expressions) so I'll just show you what I got. Note that I'm trying to use a backrefernce inside a character class, which I found to be a wrong practice (or so I think).
'/<\s*(?:img|input|area)\s[^>]*alt\s*=\s*("|\')([^\1>]*)\1[^>]*>/siU'
I've also seen in StackOverflow, some people recommending HTML parsers for stuff like this, but I'm worried about how much resources this practice may consume. Would you think this is a better idea? Thank you!
Upvotes: 0
Views: 686
Reputation: 1410
Absolutely you should use a parser. There are several reasons for this:
alt='why can't I do this'
alt="why the long space"
You can perhaps check out the StackOverflow question Robust, Mature HTML Parser for PHP for some suggestions about what parsers would be worthwhile to use.
Upvotes: 0
Reputation: 647
Using a parser is definitely the way to go.
Regex are highly inappropriate for this type of tasks, and even Jon Skeet cannot parse HTML using regular expressions
Upvotes: 2