Reputation: 23

Finding absence of words in a regular expression

I've seen examples of finding the absence of characters in a regular expression, I'm trying to find the absence of words in a regular expression (likely using a negative lookbehind).

I have lines of code like this:

Example One:

protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";

And here's another one:

mainWindow.Id = "MainWindow";

Final one:

mainStoLabel.Text = "#stb_entry_clah";

I want to capture only the middle one by finding all strings like these that a.) aren't preceded by a "#" in the actual string between the quotes, and b.) aren't preceded at all by the word "readonly".

My current Regular Expression is this:

.*\W\=\W"[^#].*"

It captures the top two examples. Now I just want to narrow down the top example. How do I capture the absence of (not characters) whole words.

Thanks.

Upvotes: 2

Answers (4)

mousio

Reputation: 10337

^[^"=]*(?<!(^|\s)readonly\s.*)\s*=\s*"[^#].*" seems to fit your needs:

everything before the first equal sign should not contain readonly or quotes
readonly is recognized not with word boundaries but with whitespace (except at beginning of line)
the equal sign can be surrounded by arbitrary whitespace
the equal sign must be followed by a quoted string
the quoted string should not start with #

You can work with lookarounds or capture groups if you only want the strings or quoted strings.

Note: as per your own regex, this discards anything after the last quote (not matching the semi-colon in your examples)

Upvotes: 1

tchrist

Reputation: 80384

The bug in your negation lookahead assertion is that you didn’t put it together right to suit the general case. You need to make its assertion apply to every character position as you crawl ahead. It only applies to one possible dot the way you’ve written it, whereas you need it to apply to all of them. See below for how you must do this to do it correctly.

Here is a working demo that shows two different approaches:

The first uses a negative lookahead to ensure that the left-hand portion not contain readonly and the right-hand portion not start with a number sign.
The second does a simpler parser, then separately inspects the left- and right-hand sides for the individual constraints that apply to each.

The demo language is Perl, but the same patterns and logic should work virtually everywhere.

#!/usr/bin/perl

while (<DATA>) {
    chomp;
#
# First demo: use a complicated regex to get desired part only
#
    my($label) = m{
        ^                           # start at the beginning
        (?:                         # noncapture group:
            (?! \b readonly \b )    #   no "readonly" here
            .                       #   now advance one character
        ) +                         # repeated 1 or more times
        \s* = \s*                   # skip an equals sign w/optional spaces
        " ( [^#"] [^"]* ) "         # capture #1: quote-delimited text
                                    #   BUT whose first char isn't a "#"
    }x;

    if (defined $label) {
        print "Demo One: found label <$label> at line $.\n";
    }
#
# Second demo: This time use simpler patterns, several
#
    my($lhs, $rhs) = m{
        ^                       # from the start of line
        ( [^=]+ )               # capture #1: 1 or more non-equals chars
        \s* = \s*               # skip an equals sign w/optional spaces
        " ( [^"]+ ) "           # capture #2: all quote-delimited text
    }x;

    unless ($lhs =~ /\b readonly \b/x || $rhs =~ /^#/) {
        print "Demo Two: found label <$rhs> at line $.\n";
    }

}
__END__
protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";
mainWindow.Id = "MainWindow";
mainStoLabel.Text = "#stb_entry_clah";

I have two bits of advice. The first is to make very sure you ALWAYS use /x mode so you can produce documented and maintainable regexes. The second is that it is much cleaner doing things a bit at a time as in the second solution rather than all at once as in the first.

Upvotes: 2