Reputation: 13
I have been learning Regex (while implementing it in python) the past couple days and haven't figured out how to solve this problem.
I have text in this format:
FOO1 = BAR2 AND Var1
Gene3 = Gene4 >= 3
Kinase = MATH OR NOT Science
BOOP = 3
I would like to identify each variable name (e.g. FOO1, BAR2, BOOP) and ignore any of the logical operators (e.g AND, OR, NOT)
Here is my attempt to a solution: (?!AND)(?!OR)(?!NOT)([a-zA-Z0-9]+)
I am having trouble telling the look-behinds to recognize AND, OR, NOT as words rather than a set of individual characters.
Any help would be appreciated . Thanks in advance!
Upvotes: 1
Views: 1172
Reputation: 14921
First of all, thanks for showing your attempts. Second, let's try to improve your regex in several ways:
You've got some nice lookaheads which could be simplified to: (?!AND|OR|NOT)([a-zA-Z0-9]+)
We don't really need a capturing group (?!AND|OR|NOT)[a-zA-Z0-9]+
Let's add a wordboundary to prevent partial matching (?!AND|OR|NOT)\b[a-zA-Z0-9]+
Let's take an example foo AND bar
as input:
foo AND bar
^ Checks if there is no "AND", "OR" or "NOT" literally
since there isn't, it will match foo with [a-zA-Z0-9]+
foo AND bar
^ no match
foo AND bar
^ Here it will fail because of the negative lookahead
foo AND bar
^ It will succeed because there is no "AND", "OR" or "NOT" literally
So the solution is to add a wordboundary \b
, this is the same as (?<!\w)
. Which means the regex would fail if there is a word character behind.
foo AND bar
^ fail, because there is a word character behind
foo AND bar
^^^ match
Upvotes: 2
Reputation: 20486
You will want to use a word boundary (\b
). This is useful for finding the start or end of a word. It works by doing a zero-length assertion (so it doesn't actually match anything, kind of like the anchors ^
and $
) on (^\w|\w\W|\W\w|\w$)
. In other words, makes sure there is a word (\w
=== [a-zA-Z0-9_]
) next to a non-word character or the beginning/end of a string. You can also combine your expression (and the capture group is most likely unnecessary):
\b(?!AND|OR|NOT)[a-zA-Z0-9]+
Note that a word boundary is not needed at the end of the expression, since regex is greedy and will grab as much of [a-zA-Z0-9]+
as possible.
If your variables can have underscores (_
) in them, it may be cleaner to use the \w
shorthand character class (which, mentioned above, is the same as [a-zA-Z0-9_]
). The final expression would be:
\b(?!AND|OR|NOT)\w+
Side note: (?!...)
is a negative look ahead not behind (they are making sure the characters in front of the engine's internal pointer do not match ...
).
Upvotes: 1