Reputation: 69
so i'd just like to quickly put out there that regex is a suitable solution for this problem, the html it is parsing is and will always be formatted the same.
The particular piece of html I am interested in parsing looks similar to the following
<a href="" target="" onCick=""><img style="" onmouseover="" onmouseout="" src="" alt="" /></a>
I am interested in pulling the 'src' and 'alt' tags out of that string. Regex really confuses me to the point that I don't really understand what i'm doing with it. so real help would be appreciated. Would mean alot, thanks.
Upvotes: 1
Views: 79
Reputation: 168966
Which language are you using? Regexp dialects have some minor differences.
Either way, for JavaScript you could use
var match = /src="(.*?)"\s+alt="(.*?)"/.exec(pieceOfHTML);
// match[1] should be the src, match[2] the alt
or for Python,
match = re.search(r'src="(.*?)"\s+alt="(.*?)', pieceOfHTML)
# match.group(1) and match.group(2) respectively
EDIT re comments:
<a href=".*?"\s+target=".*?"\s+onCick=".*?"><img style=".*?"\s+onmouseover=".*?" onmouseout=".*?"\s+src="(.*?)"\s+alt="(.*?)"
should be a decent regexp to match only the pattern required, with lenience regarding whitespace.
Upvotes: 1