Peter
Peter

Reputation: 31

Regex negation - word parsing

I am trying to parse a phrase and exclude common words.

For instance in the phrase "as the world turns", I want to exclude the common words "as" and "the" and return only "world" and "turns".

(\w+(?!the|as))

Doesn't work. Feedback appreciated.

Upvotes: 3

Views: 317

Answers (2)

Mark Byers
Mark Byers

Reputation: 838896

The lookahead should come first:

(\b(?!(the|as)\b)\w+\b)

I have also added word boundaries to ensure that it only matches whole words otherwise it would fail to match the complete word "as" but it would successfully match the letter "s" of that word.

You might also want to consider what \w matches and if that meets your needs. If you are looking for words in English you probably are interested in letters but not digits and you may wish to include some punctuation characters that are excluded by \w, such as apostrophes. You could try something like this instead (Rubular):

/(\b(?!(?:the|as)\b)[a-z'-]+\b)/i

To match words more accurately in a human language you could consider using a natural language parsing library instead of regular expressions.

Upvotes: 2

Gumbo
Gumbo

Reputation: 655599

You should use word boundaries to only match whole words. Either with a look-ahead assertion:

(\b(?!(?:the|as)\b)\w+\b)

Or with a look-behind assertion:

(\b\w+\b(?<!\b(?:the|as)))

Upvotes: 1

Related Questions