Akerus
Akerus

Reputation: 381

regex substring not containing words

I'm looking for a regex that matches strings from multiple lines that do not include certain words/characters.

In my case it is for refactoring of HTML template files. I have to remove inline stylings except when they contain a display:none; or $TEMPLATE_VARIABLE. For this I'm trying to use the search and replace function with regex from Netbeans.

What I had first is following:

    style="[^"(?!\$)]*"

Regex Test 1 This matches all style declarations that does not include template variables, but unfortunately does include display:none.

After some research I came up with the following:

    style="(?!display\s*:\s*none)[^"(?!\$)]*"

Regex Test 2 This works until something in the style declaration preceedes the display:none style.

Trying different approaches with negative lookbehinds and lookaheads did not result in success. For example:

    style="(?!.*(\$|display)).*"

Regex Test 3 This seemed to work at first glance but has several problems: other HTML element attributes that follow a style definition are matched together with the style definition and if there is a template variable used somewhere after the style definition there is no match for that style.

Does anyone have an idea how the regex has to look so that it turns this

    <span style="border: 1px solid red">Test</span>
    <form style="border: 1px solid black" method="POST">
        <span style="color:red; $TEMPLATE_VARIABLE"><span style="background-color:blue;" >Test</span>Test</span>
        <div style="display: none;">
            <span style="color: green; display: none;">Test</span>
            <span style="display: inline-block">Test $NOT_STYLING_TEMPLATE_VARIABLE</span>
        </div>
    </form>

into this?

    <span>Test</span>
    <form method="POST">
        <span style="color:red; $TEMPLATE_VARIABLE"><span>Test</span>Test</span>
        <div style="display: none;">
            <span style="color: green; display: none;">Test</span>
            <span">Test $NOT_STYLING_TEMPLATE_VARIABLE</span>
        </div>
    </form>

The remaining stylings where display:none or template variables are used will be cleaned by hand.

Thanks in advance!

Upvotes: 2

Views: 36

Answers (1)

ctwheels
ctwheels

Reputation: 22817

Brief

You shouldn't be using regex to parse HTML, but I'll answer it in regex anyway since you are specifying an answer in regex and haven't specified any other language.

Also, I'd suggest changing \$ in the regex to \$\w+ since a[href$=".pdf"] is valid CSS and you might magically catch something like that (although I'm not sure how, but I'm sure you can be creative). It does add a somewhat preventative measure.

P.S Your regex was very close. In regex . will match any character. I've changed that to [^"] since the issue is the . was also capturing ".


Code

See this code in use here

\s*style="(?![^"]*(\$|display:\s*none))[^"]*"(?:\s*(?=>))?

Results

Input

<span style="border: 1px solid red">Test</span>
<form style="border: 1px solid black" method="POST">
    <span style="color:red; $TEMPLATE_VARIABLE"><span style="background-color:blue;" >Test</span>Test</span>
    <div style="display: none;">
        <span style="color: green; display: none;">Test</span>
        <span style="display: inline-block">Test $NOT_STYLING_TEMPLATE_VARIABLE</span>
    </div>
</form>

Output

<span>Test</span>
<form method="POST">
    <span style="color:red; $TEMPLATE_VARIABLE"><span>Test</span>Test</span>
    <div style="display: none;">
        <span style="color: green; display: none;">Test</span>
        <span>Test $NOT_STYLING_TEMPLATE_VARIABLE</span>
    </div>
</form>

Explanation

  • \s* Match any whitespace character any number of times
  • style=" Match this string literally
  • (?![^"]*(\$|display:\s*none)) Negative lookahead ensuring that what follows does not match the following
    • [^"]* Match any character except "
    • (\$|display:\s*none) Match either of the following
      • \$ Match $ literally
      • display:\s*none Match display: literally, followed by any number of whitespace characters, followed by none literally
  • [^"]* Match any character except "
  • " Match " literally
  • (?:\s*(?=>))? Potentially match any following whitespace characters if the positive lookahead is true (if the following character is >) - This removes extra whitespace when it's not followed by any other attributes

Upvotes: 5

Related Questions