Reputation: 3
I'm using Python's re package (yes I am aware that regular expressions are more general, but who knows, there may be other packages) to read some data which includes inequalities with variable names after which come +, -, >, < or =. (It's a system of inequalities.) I need to filter out the variable names.
Up until now, I used
var_pattern = re.compile(r'[a-z|A-Z]+\d*\.?')
which is somewhat 'hacky' as it isn't too general. I didn't mind but came across a problem with weird names as below.
My next go was
var_pattern = re.compile(r'[a-z|A-Z]+[a-zA-Z0-9_.]*')
which should, after at least one initial letter, match just about everything that occurs except for +,-, >, < and =. This works nice with variable names like 'x23' oder 'C2000001.' but not with 'x_w_3_dummy_1'. I would have thought it might still be because of the underscore but it seems to work just fine with the variable 'x_b_1_0_0'.
Does anybody have an idea of what might cause and, more importantly, how to fix it?
As an aside, I also tried
var_pattern = re.compile(r'[a-z|A-Z]+[^+^-^>^<^=]*')
but to no avail either.
Upvotes: 0
Views: 2171
Reputation: 619
Your question has already been answered, apart from why your original expression didn't work with your underscores. If you have the pattern
r'[a-zA-Z][a-zA-Z0-9_.]*'
then because of the dot it's actually equivalent to
r'[a-zA-Z].*'
so contrary to what you thought, this does match both your "x_w_3_dummy_1" and your "x_b_1_0_0". The problem is that because of the dot it will also match your subsequent delimiter, like your +,-, >, < and = as well as anything after it.
Upvotes: 0
Reputation: 1122062
Your pattern should work just fine for your example, but correcting your pattern a little to actually match your intention:
r'[a-zA-Z][a-zA-Z0-9_]*'
This matches 1 initial letter (lower or uppcase), followed by 0 or more letters, digits and underscores. Your version had a redundant +
, and included |
in what was allowed for the first character, and .
for the rest of the name.
A demonstration to show this matches all your samples:
>>> import re
>>> names = ('x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0')
>>> var_pattern = re.compile(r'[a-zA-Z][a-zA-Z0-9_]*')
>>> for name in names:
... print var_pattern.search(name).group()
...
x23
C2000001
x_w_3_dummy_1
x_b_1_0_0
The pattern does not match any +
, -
, >
, <
or =
characters that might follow the variable name:
>>> var_pattern.findall('x23<10\nC2000001=24\nx_w_3_dummy_1+15\nx_b_1_0_0-5')
['x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0']
Upvotes: 2