Reputation: 17
In an HTML source I need to extract any simple text inside a FONT tag with exactly (no more, no less) these 3 attributes, in any order: size=5, color="red", face="verdana".
The regular expression must thus for example extract all the following "randomtext" except the last four.
<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>
I solved the "in any order" problem by using 3 look-aheads:
<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")[^>]*>([^<]+)</font>
...or for more html flexibility:
<\s*font(?=[^>]*\s+size\s*=\s*5)(?=[^>]*\scolor\s*=\s*["']red["'])(?=[^>]*\sface\s*=\s*["']verdana["'])[^>]*>\s*([^<]+?)\s*<\s*/font\s*>
The problem is that it also matches the last three. How can I exclude those matching? (obviously in a general and reasonable short/efficient way, i.e. without codyfing all possible positive combinations and without using literal negative expressions that work only on my examples)
Upvotes: 0
Views: 797
Reputation: 36262
One way, also according with who says that regexp is not the tool for the job:
Content of script.pl
(with the regexp inside and explained):
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Text matched: %s\t (original string: %s)\n], $1, $& if
m/
# At begin of line, '<' character plus optional space.
\A < \s*
# Literal 'font' word.
font
# Mandatory space.
\s+
# Positive look-ahead for string 'size=5'
(?= .* size \s* = \s* 5 (?:\s+|>) )
# Positive look-ahead for string 'face="verdana"'
(?= .* face \s* = \s* "verdana" (?:\s+|>) )
# Positive look-ahead for string 'color="red"'
(?= .* color \s* = \s* "red" (?:\s+|>) )
# If last three look-ahead succeed, match them.
(?:size\s*=\s*5\s*|color\s*=\s*"red"\s*|face\s*=\s*"verdana"\s*){3}
# Literal '>' character.
>
# Text between tags.
([^>]+)
# Close tag and match end of string.
<\/font> \Z
/x;
}
__DATA__
<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>
Run it like:
perl script.pl
With following result:
Text matched: randomtext (original string: <font size=5 color="red" face="verdana">randomtext</font>)
Text matched: randomtext (original string: <font size=5 face="verdana" color="red">randomtext</font>)
Text matched: randomtext (original string: <font color="red" size=5 face="verdana">randomtext</font>)
Text matched: randomtext (original string: <font color="red" face="verdana" size=5>randomtext</font>)
Text matched: randomtext (original string: <font face="verdana" size=5 color="red">randomtext</font>)
Text matched: randomtext (original string: <font face="verdana" color="red" size=5>randomtext</font>)
Upvotes: 1
Reputation: 92976
You recognize that this is getting difficult? If you have another possibility, use it!
For regex try this:
<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")(?![^>]*(?<!color|size|face)=)(?:\s+[^>\s=]+=[^>\s=]+\s*)+>([^<]+)</font>
See it here on Regexr
I added/changed two things:
(?![^>]*(?<!color|size|face)=)
is a negative lookahead with a nested negative look behind assertion, it does not allow an equal sign when there is not color, size or face before.
I changed your [^>]*
that is matching the attributes to (?:\s+[^>\s=]+=[^>\s=]+\s*)+
, so that it matches only non whitespace sequences that doesn't contain an equal sign.
Upvotes: 0