codyc4321
codyc4321

Reputation: 9672

python regex ignoring underscore incorrectly

I am trying to grab filenames from a list of endings that looks like this:

final count: {'.pem': 5027, '__base__': 434, '.rb': 62341, '/AUTHORS': 1358, '.sty': 859, '.gitignore': 193,...}

My regex looks as follows:

p = re.compile(r"'([\W]+)(.*?)'")

It works ok except on '__base__', where I get '__base__' instead of the 'base' I want due to underscores being a word-like character. I tried:

p = re.compile(r"'([\W]+|\_+)(.*?)'")
p = re.compile(r"'([\W]+|_+)(.*?)'")    

and

p = re.compile(r"'([\W]+)|(_+)(.*?)'")

but none worked. What is the proper way to do this? Thank you

Upvotes: 0

Views: 463

Answers (3)

James Lemieux
James Lemieux

Reputation: 760

Try adding in the carat to make an exception to your regex

p = re.compile(r"'([\W^_]+)(.*?)'")

When ^ is outside of a matching group (the square brackets) it means at the beginning of a string or beginning of a new line. When it is inside the matching group, it means "negates" or "not".

Upvotes: 2

user
user

Reputation: 5696

You can use this:

re.findall(r"([a-zA-Z0-9]+)_{0,2}':", my_str)

It will capture only consecutive letters and numbers before 0 to 2 _, and ':, since you only need the string before ':.

Explanation:
{0,2} matches 0 to 2 of the previous.
[a-zA-Z0-9]+ is used instead of \w+ since the latter would match _ as well.

Upvotes: 1

vks
vks

Reputation: 67968

p = re.compile(r"'([^a-zA-Z0-9]+)(.*?)'")

You can simply use this .

Upvotes: 1

Related Questions