diegor
diegor

Reputation: 75

match a double quoted-string with double-quote inside

I have this python string:

string = '"/dev/null" "" "19/1333329478.9381399" 0 1 "cam-foo" 64 900.0 "Foo x rev scan of test" "/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py" 60.145855 2.034689'

I need a regex that gives me a list of every element in this string. Element: any number or any string contained in a double quote. A string can contain a double quote.

I've come out with this regex:

import re    
p = re.compile(r'"[^"]*"|[-\.\d]+')
p.findall(string)
['"/dev/null"', '""', '"19/1333329478.9381399"', '0', '1', '"cam-foo"', '64', '900.0', '"Foo x rev scan of test"', '"/usr/bin/env "', '"PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"', '" python app.py"', '60.145855', '2.034689']

As you can see I miss the part of double-quote inside the string. Double-quote inside an element should be ignored. I'd like to have this result:

['"/dev/null"', '""', '"19/1333329478.9381399"', '0', '1', '"cam-foo"', '64', '900.0', '"Foo x rev scan of test"', '"/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py"', '60.145855', '2.034689']

Instead to have 3 (or more) elements

[..., '"/usr/bin/env "', '"PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"', '" python app.py"', ...]

I'd like to have only one element:

'"/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py"'

Anyone can help me?

Upvotes: 2

Views: 4211

Answers (4)

jfs
jfs

Reputation: 414285

You could use csv module.

Example

>>> import csv
>>> from pprint import pprint
>>> pprint(list(csv.reader([string], delimiter=' ', quotechar='"')))
[
[
'/dev/null'
,
''
,
'19/1333329478.9381399'
,
'0'
,
'1'
,
'cam-foo'
,
'64'
,
'900.0'
,
'Foo x rev scan of test'
,
'/usr/bin/env "PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH" python app.py'
,
'60.145855'
,
'2.034689'
]
]

Upvotes: 3

jathanism
jathanism

Reputation: 33716

If all you need is to be able to split this exact case, you can use shlex.split():

>>> import shlex
>>> s = '"/dev/null" "" "19/1333329478.9381399" 0 1 "cam-foo" 64 900.0 "Foo x rev scan of test" "/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py" 60.145855 2.034689'
>>> shlex.split(s)
['/dev/null', '', '19/1333329478.9381399', '0', '1', 'cam-foo', '64', '900.0', 'Foo x rev scan of test', '/usr/bin/env PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH python app.py', '60.145855', '2.034689']
>>> shlex.split(s)[-3]
'/usr/bin/env PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH python app.py'

It's not regex, but it will solve this exact problem for you every time.

Upvotes: 1

The first half of your regular expression currently matches a pair of double quotes surrounding zero or more non-double-quote characters.

r'"[^"]*"'

You can achieve your desired result by changing which strings you match inside the surrounding double quotes.

r'"(?:[^"]|"")*"'

This regular expression matches a pair of double quotes that surround zero or more strings; each string must consist of either one non-double-quote character or two consecutive double quotes. (The ?: marks the parenthesized bit as a non-capturing group; otherwise Python will only return the bit inside the parentheses.)

Let's plug that into your complete regex:

% python
Python 2.7.2 (default, Mar 20 2012, 13:27:18) 
[GCC 4.2.1 Compatible Apple Clang 3.1 (tags/Apple/clang-318.0.54)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> s = '"/dev/null" "" "19/1333329478.9381399" 0 1 "cam-foo" 64 900.0 "Foo x rev scan of test" "/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py" 60.145855 2.034689'
>>> for el in re.findall(r'"(?:[^"]|"")*"|[-\.\d]+', s): print(el)
... 
"/dev/null"
""
"19/1333329478.9381399"
0
1
"cam-foo"
64
900.0
"Foo x rev scan of test"
"/usr/bin/env ""PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:$PATH"" python app.py"
60.145855
2.034689
>>>

Upvotes: 3

darnir
darnir

Reputation: 5180

Enclose the regex search token in (). What happens is, re will nor return a list for each find. Pick the right array element. E.g.:

m = p.findall(string)

Will return a list in m, whose each element is again a tokenised list according to what was enclosed in your (). This way you can retrieve the exact part of the statement that you desire.

Upvotes: 0

Related Questions