rubek
rubek

Reputation: 3

Regular Expression Matching With Underscores

I'm using Python's re package (yes I am aware that regular expressions are more general, but who knows, there may be other packages) to read some data which includes inequalities with variable names after which come +, -, >, < or =. (It's a system of inequalities.) I need to filter out the variable names.

Up until now, I used

var_pattern = re.compile(r'[a-z|A-Z]+\d*\.?')

which is somewhat 'hacky' as it isn't too general. I didn't mind but came across a problem with weird names as below.

My next go was

var_pattern = re.compile(r'[a-z|A-Z]+[a-zA-Z0-9_.]*')

which should, after at least one initial letter, match just about everything that occurs except for +,-, >, < and =. This works nice with variable names like 'x23' oder 'C2000001.' but not with 'x_w_3_dummy_1'. I would have thought it might still be because of the underscore but it seems to work just fine with the variable 'x_b_1_0_0'.

Does anybody have an idea of what might cause and, more importantly, how to fix it?

As an aside, I also tried

var_pattern = re.compile(r'[a-z|A-Z]+[^+^-^>^<^=]*')

but to no avail either.

Upvotes: 0

Views: 2171

Answers (3)

Penfold
Penfold

Reputation: 619

Your question has already been answered, apart from why your original expression didn't work with your underscores. If you have the pattern

r'[a-zA-Z][a-zA-Z0-9_.]*'

then because of the dot it's actually equivalent to

r'[a-zA-Z].*'

so contrary to what you thought, this does match both your "x_w_3_dummy_1" and your "x_b_1_0_0". The problem is that because of the dot it will also match your subsequent delimiter, like your +,-, >, < and = as well as anything after it.

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1122062

Your pattern should work just fine for your example, but correcting your pattern a little to actually match your intention:

r'[a-zA-Z][a-zA-Z0-9_]*'

This matches 1 initial letter (lower or uppcase), followed by 0 or more letters, digits and underscores. Your version had a redundant +, and included | in what was allowed for the first character, and . for the rest of the name.

A demonstration to show this matches all your samples:

>>> import re
>>> names = ('x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0')
>>> var_pattern = re.compile(r'[a-zA-Z][a-zA-Z0-9_]*')
>>> for name in names:
...     print var_pattern.search(name).group()
... 
x23
C2000001
x_w_3_dummy_1
x_b_1_0_0

The pattern does not match any +, -, >, < or = characters that might follow the variable name:

>>> var_pattern.findall('x23<10\nC2000001=24\nx_w_3_dummy_1+15\nx_b_1_0_0-5')
['x23', 'C2000001', 'x_w_3_dummy_1', 'x_b_1_0_0']

Upvotes: 2

Mikhail Vladimirov
Mikhail Vladimirov

Reputation: 13890

Should be:

[a-zA-Z_][a-zA-Z0-9_.]*

Upvotes: 0

Related Questions