Artem
Artem

Reputation: 384

Regular expressions: find string without substring

I have a big text:

"Big piece of text. This sentence includes 'regexp' word. And this
sentence doesn't include that word"

I need to find substring that starts by 'this' and ends by 'word' but doesn't include word 'regexp'.

In this case the string: "this sentence doesn't include that word" is exactly what I want to receive.

How can I do this via Regular Expressions?

Upvotes: 30

Views: 45364

Answers (2)

Igor Chubin
Igor Chubin

Reputation: 64563

Use lookahead asseterions.

When you want to check if a string does not contain another substring, you can write:

/^(?!.*substring)/

You must check also the beginning and the end of line for this and word:

/^this(?!.*substring).*word$/

Another problem here is that you don't want to find strings, you want to find sentences (if I understand your task right).

So the solution looks like this:

perl -e '
  local $/;
  $_=<>;
  while($_ =~ /(.*?[.])/g) { 
    $s=$1;
    print $s if $s =~ /^this(?!.*substring).*word[.]$/
  };'

Example of usage:

$ cat 1.pl
local $/;
$_=<>;
while($_ =~ /(.*?[.])/g) {
    $s=$1;
    print $s if $s =~ /^\s*this(?!.*regexp).*word[.]/i;
};

$ cat 1.txt
This sentence has the "regexp" word. This sentence doesn't have the word. This sentence does have the "regexp" word again.

$ cat 1.txt | perl 1.pl 
 This sentence doesn't have the word.

Upvotes: 13

Andrew Clark
Andrew Clark

Reputation: 208415

With an ignore case option, the following should work:

\bthis\b(?:(?!\bregexp\b).)*?\bword\b

Example: http://www.rubular.com/r/g6tYcOy8IT

Explanation:

\bthis\b           # match the word 'this', \b is for word boundaries
(?:                # start group, repeated zero or more times, as few as possible
   (?!\bregexp\b)    # fail if 'regexp' can be matched (negative lookahead)
   .                 # match any single character
)*?                # end group
\bword\b           # match 'word'

The \b surrounding each word makes sure that you aren't matching on substrings, like matching the 'this' in 'thistle', or the 'word' in 'wordy'.

This works by checking at each character between your start word and your end word to make sure that the excluded word doesn't occur.

Upvotes: 49

Related Questions