Imbuter
Imbuter

Reputation: 17

regular expression for checking attributes in an html tag

In an HTML source I need to extract any simple text inside a FONT tag with exactly (no more, no less) these 3 attributes, in any order: size=5, color="red", face="verdana".

The regular expression must thus for example extract all the following "randomtext" except the last four.

<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>

I solved the "in any order" problem by using 3 look-aheads:

<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")[^>]*>([^<]+)</font>

...or for more html flexibility:

<\s*font(?=[^>]*\s+size\s*=\s*5)(?=[^>]*\scolor\s*=\s*["']red["'])(?=[^>]*\sface\s*=\s*["']verdana["'])[^>]*>\s*([^<]+?)\s*<\s*/font\s*>

The problem is that it also matches the last three. How can I exclude those matching? (obviously in a general and reasonable short/efficient way, i.e. without codyfing all possible positive combinations and without using literal negative expressions that work only on my examples)

Upvotes: 0

Views: 797

Answers (2)

Birei
Birei

Reputation: 36262

One way, also according with who says that regexp is not the tool for the job:

Content of script.pl (with the regexp inside and explained):

use warnings;
use strict;

while ( <DATA> ) {
    printf qq[Text matched: %s\t (original string: %s)\n], $1, $& if 
    m/ 
        # At begin of line, '<' character plus optional space.
        \A < \s*
        # Literal 'font' word.
        font
        # Mandatory space.
        \s+
        # Positive look-ahead for string 'size=5'
        (?= .* size \s* = \s* 5 (?:\s+|>) )   
        # Positive look-ahead for string 'face="verdana"'
        (?= .* face \s* = \s* "verdana" (?:\s+|>) )
        # Positive look-ahead for string 'color="red"'
        (?= .* color \s* = \s* "red" (?:\s+|>) )
        # If last three look-ahead succeed, match them.
        (?:size\s*=\s*5\s*|color\s*=\s*"red"\s*|face\s*=\s*"verdana"\s*){3}
        # Literal '>' character.
        >
        # Text between tags.
        ([^>]+)
        # Close tag and match end of string.
        <\/font> \Z
    /x;
}

__DATA__
<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>

Run it like:

perl script.pl

With following result:

Text matched: randomtext         (original string: <font size=5 color="red" face="verdana">randomtext</font>)
Text matched: randomtext         (original string: <font size=5 face="verdana" color="red">randomtext</font>)
Text matched: randomtext         (original string: <font color="red" size=5 face="verdana">randomtext</font>)
Text matched: randomtext         (original string: <font color="red" face="verdana" size=5>randomtext</font>)
Text matched: randomtext         (original string: <font face="verdana" size=5 color="red">randomtext</font>)
Text matched: randomtext         (original string: <font face="verdana" color="red" size=5>randomtext</font>)

Upvotes: 1

stema
stema

Reputation: 92976

You recognize that this is getting difficult? If you have another possibility, use it!

For regex try this:

<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")(?![^>]*(?<!color|size|face)=)(?:\s+[^>\s=]+=[^>\s=]+\s*)+>([^<]+)</font>

See it here on Regexr

I added/changed two things:

  1. (?![^>]*(?<!color|size|face)=) is a negative lookahead with a nested negative look behind assertion, it does not allow an equal sign when there is not color, size or face before.

  2. I changed your [^>]* that is matching the attributes to (?:\s+[^>\s=]+=[^>\s=]+\s*)+, so that it matches only non whitespace sequences that doesn't contain an equal sign.

Upvotes: 0

Related Questions