Reputation: 18754
I have some strings that contains info between two quotes like:
cc "1/11/2A" "1/20+21/1 1" "XX" 0
I am using re.findall('\"*\"', line)
to match parts between quotes but doesn't work for some reason. I tried many other things but all I get is some empty list with nothing in it. What am I doing wrong ?
Upvotes: 0
Views: 235
Reputation: 365707
It looks like you were expecting *
to match "anything", the way it does in filename wildcards.
But that's not what it means in regex. It modifies the preceding expression, to match zero or more copies of that expression.
To get filename-style wildcard, you want to use .*
.
However, that won't actually work, because .
matches anything—including "
. So, it will grab everything up to the very last "
character, leaving only that for the rest of the expression, meaning findall
will find one big string instead of three small ones.
You can fix that by making the repetition non-greedy, with .*?
. This will match everything up to the first "
.
So:
>>> re.findall('\".*?\"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']
I think Martijn Pieters's answer is probably conceptually clearer; I've only offered this because I think this may be the way you were trying to attack the problem, and I wanted to show how you could have gotten there.
As a side note, regex code is much easier to read if you use raw strings, so you can get rid of the excess backslash escapes. In this case, the backslashes are already unnecessary—you don't need to escape double-quotes in either a single-quoted string or a regex. But instead of trying to remember what does and doesn't need to be escaped by the Python parser so it can get to the regex parser, it's easier to just use raw strings. So:
>>> re.findall(r'".*?"', line)
['"1/11/2A"', '"1/20+21/1 1"', '"XX"']
Upvotes: 2
Reputation: 1121834
You are matching 0 or more quotes followed by a quote. Use a negative character class instead:
re.findall(r'"[^"]*"', line)
You may want to put a capturing group around the negative character class:
re.findall(r'"([^"]*)"', line)
and now .findall()
returns everything within quotes, not including the quotes themselves:
>>> import re
>>> re.findall(r'"([^"]*)"', 'cc "1/11/2A" "1/20+21/1 1" "XX" 0')
['1/11/2A', '1/20+21/1 1', 'XX']
The [^...]
negative character class notation means: match any character that is not included in the set of characters named here. [^"]
thus matches any character that is not a quote, neatly limiting the matched characters to everything that is within quotes.
Upvotes: 4
Reputation: 65791
It should be r'"[^"]*"'
. Your pattern matches one or more "
characters in a row.
In [4]: re.findall(r'"[^"]*"', line)
Out[4]: ['"1/11/2A"', '"1/20+21/1 1"', '"XX"']
Upvotes: 2