Kosman
Kosman

Reputation: 13

Match all characters except for certain words

I have been learning Regex (while implementing it in python) the past couple days and haven't figured out how to solve this problem.

I have text in this format:

FOO1 = BAR2 AND Var1
Gene3 = Gene4 >= 3
Kinase = MATH OR NOT Science
BOOP = 3

I would like to identify each variable name (e.g. FOO1, BAR2, BOOP) and ignore any of the logical operators (e.g AND, OR, NOT)

Here is my attempt to a solution: (?!AND)(?!OR)(?!NOT)([a-zA-Z0-9]+)

I am having trouble telling the look-behinds to recognize AND, OR, NOT as words rather than a set of individual characters.

Any help would be appreciated . Thanks in advance!

Upvotes: 1

Views: 1172

Answers (2)

HamZa
HamZa

Reputation: 14921

First of all, thanks for showing your attempts. Second, let's try to improve your regex in several ways:

  1. You've got some nice lookaheads which could be simplified to: (?!AND|OR|NOT)([a-zA-Z0-9]+)

  2. We don't really need a capturing group (?!AND|OR|NOT)[a-zA-Z0-9]+

  3. Let's add a wordboundary to prevent partial matching (?!AND|OR|NOT)\b[a-zA-Z0-9]+

Let's take an example foo AND bar as input:

foo AND bar
^ Checks if there is no "AND", "OR" or "NOT" literally
since there isn't, it will match foo with [a-zA-Z0-9]+

foo AND bar
   ^ no match

foo AND bar
    ^ Here it will fail because of the negative lookahead

foo AND bar
     ^ It will succeed because there is no "AND", "OR" or "NOT" literally

So the solution is to add a wordboundary \b, this is the same as (?<!\w). Which means the regex would fail if there is a word character behind.

foo AND bar
     ^ fail, because there is a word character behind

foo AND bar
        ^^^ match

Online demo

Upvotes: 2

Sam
Sam

Reputation: 20486

You will want to use a word boundary (\b). This is useful for finding the start or end of a word. It works by doing a zero-length assertion (so it doesn't actually match anything, kind of like the anchors ^ and $) on (^\w|\w\W|\W\w|\w$). In other words, makes sure there is a word (\w === [a-zA-Z0-9_]) next to a non-word character or the beginning/end of a string. You can also combine your expression (and the capture group is most likely unnecessary):

\b(?!AND|OR|NOT)[a-zA-Z0-9]+

Demo

Note that a word boundary is not needed at the end of the expression, since regex is greedy and will grab as much of [a-zA-Z0-9]+ as possible.


If your variables can have underscores (_) in them, it may be cleaner to use the \w shorthand character class (which, mentioned above, is the same as [a-zA-Z0-9_]). The final expression would be:

\b(?!AND|OR|NOT)\w+

Side note: (?!...) is a negative look ahead not behind (they are making sure the characters in front of the engine's internal pointer do not match ...).

Upvotes: 1

Related Questions