Cemre Mengü
Cemre Mengü

Reputation: 18754

Python findall does not return expected values

I have some strings that contains info between two quotes like:

cc "1/11/2A" "1/20+21/1 1" "XX" 0

I am using re.findall('\"*\"', line) to match parts between quotes but doesn't work for some reason. I tried many other things but all I get is some empty list with nothing in it. What am I doing wrong ?

Upvotes: 0

Views: 235

Answers (3)

abarnert
abarnert

Reputation: 365707

It looks like you were expecting * to match "anything", the way it does in filename wildcards.

But that's not what it means in regex. It modifies the preceding expression, to match zero or more copies of that expression.

To get filename-style wildcard, you want to use .*.

However, that won't actually work, because . matches anything—including ". So, it will grab everything up to the very last " character, leaving only that for the rest of the expression, meaning findall will find one big string instead of three small ones.

You can fix that by making the repetition non-greedy, with .*?. This will match everything up to the first ".

So:

>>> re.findall('\".*?\"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

I think Martijn Pieters's answer is probably conceptually clearer; I've only offered this because I think this may be the way you were trying to attack the problem, and I wanted to show how you could have gotten there.

As a side note, regex code is much easier to read if you use raw strings, so you can get rid of the excess backslash escapes. In this case, the backslashes are already unnecessary—you don't need to escape double-quotes in either a single-quoted string or a regex. But instead of trying to remember what does and doesn't need to be escaped by the Python parser so it can get to the regex parser, it's easier to just use raw strings. So:

>>> re.findall(r'".*?"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

Upvotes: 2

Martijn Pieters
Martijn Pieters

Reputation: 1121834

You are matching 0 or more quotes followed by a quote. Use a negative character class instead:

re.findall(r'"[^"]*"', line)

You may want to put a capturing group around the negative character class:

re.findall(r'"([^"]*)"', line)

and now .findall() returns everything within quotes, not including the quotes themselves:

>>> import re
>>> re.findall(r'"([^"]*)"', 'cc "1/11/2A" "1/20+21/1 1" "XX" 0')
['1/11/2A', '1/20+21/1 1', 'XX']

The [^...] negative character class notation means: match any character that is not included in the set of characters named here. [^"] thus matches any character that is not a quote, neatly limiting the matched characters to everything that is within quotes.

Upvotes: 4

Lev Levitsky
Lev Levitsky

Reputation: 65791

It should be r'"[^"]*"'. Your pattern matches one or more " characters in a row.

In [4]: re.findall(r'"[^"]*"', line)
Out[4]: ['"1/11/2A"', '"1/20+21/1 1"', '"XX"']

Upvotes: 2

Related Questions