Auzette
Auzette

Reputation: 23

Finding absence of words in a regular expression

I've seen examples of finding the absence of characters in a regular expression, I'm trying to find the absence of words in a regular expression (likely using a negative lookbehind).

I have lines of code like this:

Example One:

protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";

And here's another one:

mainWindow.Id = "MainWindow";

Final one:

mainStoLabel.Text = "#stb_entry_clah";

I want to capture only the middle one by finding all strings like these that a.) aren't preceded by a "#" in the actual string between the quotes, and b.) aren't preceded at all by the word "readonly".

My current Regular Expression is this:

.*\W\=\W"[^#].*"

It captures the top two examples. Now I just want to narrow down the top example. How do I capture the absence of (not characters) whole words.

Thanks.

Upvotes: 2

Views: 5343

Answers (4)

mousio
mousio

Reputation: 10337

^[^"=]*(?<!(^|\s)readonly\s.*)\s*=\s*"[^#].*" seems to fit your needs:

  • everything before the first equal sign should not contain readonly or quotes
  • readonly is recognized not with word boundaries but with whitespace (except at beginning of line)
  • the equal sign can be surrounded by arbitrary whitespace
  • the equal sign must be followed by a quoted string
  • the quoted string should not start with #

You can work with lookarounds or capture groups if you only want the strings or quoted strings.

Note: as per your own regex, this discards anything after the last quote (not matching the semi-colon in your examples)

Upvotes: 1

tchrist
tchrist

Reputation: 80384

The bug in your negation lookahead assertion is that you didn’t put it together right to suit the general case. You need to make its assertion apply to every character position as you crawl ahead. It only applies to one possible dot the way you’ve written it, whereas you need it to apply to all of them. See below for how you must do this to do it correctly.

Here is a working demo that shows two different approaches:

  1. The first uses a negative lookahead to ensure that the left-hand portion not contain readonly and the right-hand portion not start with a number sign.

  2. The second does a simpler parser, then separately inspects the left- and right-hand sides for the individual constraints that apply to each.

The demo language is Perl, but the same patterns and logic should work virtually everywhere.

#!/usr/bin/perl

while (<DATA>) {
    chomp;
#
# First demo: use a complicated regex to get desired part only
#
    my($label) = m{
        ^                           # start at the beginning
        (?:                         # noncapture group:
            (?! \b readonly \b )    #   no "readonly" here
            .                       #   now advance one character
        ) +                         # repeated 1 or more times
        \s* = \s*                   # skip an equals sign w/optional spaces
        " ( [^#"] [^"]* ) "         # capture #1: quote-delimited text
                                    #   BUT whose first char isn't a "#"
    }x;

    if (defined $label) {
        print "Demo One: found label <$label> at line $.\n";
    }
#
# Second demo: This time use simpler patterns, several
#
    my($lhs, $rhs) = m{
        ^                       # from the start of line
        ( [^=]+ )               # capture #1: 1 or more non-equals chars
        \s* = \s*               # skip an equals sign w/optional spaces
        " ( [^"]+ ) "           # capture #2: all quote-delimited text
    }x;

    unless ($lhs =~ /\b readonly \b/x || $rhs =~ /^#/) {
        print "Demo Two: found label <$rhs> at line $.\n";
    }

}
__END__
protected static readonly string BACKGROUND_MUSIC_NAME = "Music_Mission_Complete_Loop_audio";
mainWindow.Id = "MainWindow";
mainStoLabel.Text = "#stb_entry_clah";

I have two bits of advice. The first is to make very sure you ALWAYS use /x mode so you can produce documented and maintainable regexes. The second is that it is much cleaner doing things a bit at a time as in the second solution rather than all at once as in the first.

Upvotes: 2

sehe
sehe

Reputation: 393064

You absolutely need to specify the language. The negative lookahead/lookbehind is the thing you need.

Look at this site for an inventory of how to do that in Delphi, GNU (Linux), Groovy, Java, JavaScript, .NET, PCRE (C/C++), Perl, PHP, POSIX, PowerShell, Python, R, REALbasic, Ruby, Tcl, VBScript, Visual Basic 6, wxWidgets, XML Schema, XQuery & XPath

Upvotes: 0

stema
stema

Reputation: 92986

I don 't understand your question completely, a negative lookahead would look like this:

(?!.*readonly)(?:.*\s\=\s"[^#].*")

The first part will match if there is not the word "readonly" in the string.

Which language are you using?

What do you want to match, only the second example, did I understand this correct?

Upvotes: 2

Related Questions