Cloudkiller
Cloudkiller

Reputation: 1636

regular expression to not match something in quotes

I have this bit of regex used in a php preg_match to strip out trailing spaces from ":" and "("

([\(:])\s+

The problem I'm running into is that it ends up stripping out spaces I need that are within quotes. For example, this string:

img[style*="float: left"]

Is there a way to write the regex so it will match any ":" or "(" unless it is enclosed in double quotes?

Upvotes: 3

Views: 224

Answers (3)

Ro Yo Mi
Ro Yo Mi

Reputation: 15010

Description

This routine will:

  • skip the matches found inside the quotes
  • replace matches found outside the quotes

Live Demo

Code

<?php

$string = 'img[style*="float: left"]
img: [style*="float: left"]
img( [style*="float: left"]
';


    $regex = '/"[^"]*"|([:(])\s+/ims';

    $output = preg_replace_callback(
        $regex,
        function ($matches) {
            if (array_key_exists (1, $matches)) {
                return $matches[1] ;
            }
            return $matches[0];
        },
        $string
    );
    echo "this is the output:"  . $output;

Output

this is the output:img[style*="float: left"]
img:[style*="float: left"]
img([style*="float: left"]

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89639

You can try this:

$text = preg_replace('~(?|(\\\{2}|\\\"|"(?>[^"\\\]+|\\\{2}|\\\")*+")|([:(])\s+)~', '$1', $text);

The idea is to match double quotes parts before ([:(])\s+ and replace them by themselves.

To avoid to match escaped quotes, backslashes are matched before.

pattern details:

~                                    # pattern delimiter
(?|                                  # branch reset : all capture groups inside have the same number
    (                                # open a capturing group
        \\\{2}                       # group of 2 backslashes (can't escape everything)
      |                              # OR
        \\\"                         # an escaped double quote
      |                              # OR
        "(?>[^"\\\]+|\\\{2}|\\\")*+" # content inside double quotes
    )                                # close the capturing group
  |                                  # OR
    ( [:(] )                         # a : or a ( in a capturing group
    \s+                              # spaces
)                                    # close the branch reset group
~                                    # pattern delimiter

The interest is to deal with this kind of situations:

img: " : \" ( "
img: \" : ( " ( "
img: \\" : ( " ( "

result:

img:" : \" ( "
img:\" :(" ( "
img:\\" : ( " ("

Upvotes: 1

Song Gao
Song Gao

Reputation: 666

There are two ways to go about this:

  1. You can use negative lookarounds (information here) to try and assert that there is not a double quote before or after something you don't want stripped. The problem I have with this is that there is no indication of how far away from the quotes : or ( might be, and lookarounds cannot be of unknown length.

  2. What I like to do, is to "preserve" anything enclosed within double quotes, with the regex \"[^"]+\" within an array, and replacing them with a string (I use "THIS_IS_A_QUOTE"). After you have stored all your quotes in an array, strip all spaces, and finally restore all "THIS_IS_A_QUOTE" strings with the strings in the array.

Upvotes: 1

Related Questions