Reputation: 555
I would like to split a string such as:
CS ID=123 HD=CT NE="HI THERE"
to a list which looks like
['CS', 'ID=123', 'HD=CT', 'NE=HI THERE']
The shlex.split()
function does this but it is terribly slow. I need to find a very fast way to do this, probably using python regular expressions. Any help is appreciated. Thanks!
So I did not anticipate that the re module when used in Jython would be nearly as slow as the shlex module. Does anyone know how to do this without the re module and instead with Java regular expressions or in some other clever way?
Upvotes: 2
Views: 1460
Reputation: 123881
Not sure you really want to strip quotes on HI THERE part, this one include double quotes.
>>> import re
>>> x = '''CS ID=123 HD=CT NE="HI THERE"'''
>>> re.findall("""\w+="[^"]*"|\w+='[^']*'|\w+=\w+|\w+""", x)
['CS', 'ID=123', 'HD=CT', 'NE="HI THERE"']
without quotes on HI THERE part
>>> map(''.join,re.findall("""(\w+=)"([^"]*)"|(\w+=)'([^']*)'|(\w+=\w+)|(\w+)""", x))
['CS', 'ID=123', 'HD=CT', 'NE=HI THERE']
Upvotes: 3
Reputation: 32439
Not using regex, but iterating over the input (no idea if faster or slower):
def tokenize (s):
quoted = False
buff = ''
for c in s:
if c == ' ' and not quoted:
yield buff
buff = ''
continue
if c == '"':
quoted = not quoted
continue
buff += c
yield buff
a = 'CS ID=123 HD=CT NE="HI THERE"'
for t in tokenize (a): print (t)
Upvotes: 0