djmac
djmac

Reputation: 905

Split string, ignoring delimiter within quotation marks (python)

I would like to split a string on a comma, but ignore cases when it is within quotation marks:

for example:

teststring = '48, "one, two", "2011/11/03"'
teststring.split(",")
['48', ' "one', ' two"', ' "2011/11/03"']

and the output I would like is:

['48', ' "one, two"', ' "2011/11/03"']

Is this possible?

Upvotes: 17

Views: 22242

Answers (6)

Mikhail Zakharov
Mikhail Zakharov

Reputation: 1099

This is not a standard module, you have to install it via pip, but as an option try tssplit:

In [5]: from tssplit import tssplit 
In [6]: tssplit('48, "one, two", "2011/11/03"', quote='"', delimiter=',', trim=' ')
Out[6]: ['48', 'one, two', '2011/11/03']

Upvotes: 7

StreetHawk
StreetHawk

Reputation: 94

import shlex
teststring = '48, "one, two", "2011/11/03"'
output = shlex.split(teststring)
output = [re.sub(r",$","",w) for w in output]
print output
['48', 'one, two', '2011/11/03']

Upvotes: 1

David Webb
David Webb

Reputation: 193814

You can use the csv module from the standard library:

>>> import csv
>>> testdata = ['48, "one, two", "2011/11/03"']
>>> testcsv = csv.reader(testdata,skipinitialspace=True)
>>> testcsv.next()
['48', 'one, two', '2011/11/03']

The one thing to watch out for is that the csv.reader objects expect an iterator which will return a string each time next() is called. This means that you can't pass a string string straight to a reader(), but you can enclose it in a list as above.

You'll have to be careful with the format of your data or tell csv how to handle it. By default the quotes have to come immediately after the comma or the csv module will interpret the field as beginning with a space rather than being quoted. You can fix this using the skipinitialspace option.

Upvotes: 9

Raymond Hettinger
Raymond Hettinger

Reputation: 226694

The csv module will work if you set options to handle this dialect:

>>> import csv
>>> teststring = '48, "one, two", "2011/11/03"'
>>> for line in csv.reader([teststring], skipinitialspace=True):
    print line


['48', 'one, two', '2011/11/03']

Upvotes: 31

jcollado
jcollado

Reputation: 40424

You can use shlex module to parse your string.

By default, shlex.split will split your string at whitespace characters not enclosed in quotes:

>>> shlex.split(teststring)
['48,', 'one, two,', '2011/11/03']

This doesn't removes the trailing commas from your string, but it's close to what you need. However, if you customize the parser to consider the comma as a whitespace character, then you'll get the output that you need:

>>> parser = shlex.shlex(teststring)
>>> parser.whitespace
' \t\r\n'
>>> parser.whitespace += ','
>>> list(parser)
['48', '"one, two"', '"2011/11/03"']

Note: the parser object is used as an iterator to get the tokens one by one. Hence, list(parser) iterates over the parser object and returns the string splitted where you need.

Upvotes: 7

Acorn
Acorn

Reputation: 50567

You should use the Python csv library: http://docs.python.org/library/csv.html

Upvotes: 3

Related Questions