mkrems
mkrems

Reputation: 555

Python Regular Expression for splitting key/value pairs separated by an equal sign when a value may or may not have quotes

I would like to split a string such as:

CS ID=123 HD=CT NE="HI THERE"

to a list which looks like

['CS', 'ID=123', 'HD=CT', 'NE=HI THERE']

The shlex.split() function does this but it is terribly slow. I need to find a very fast way to do this, probably using python regular expressions. Any help is appreciated. Thanks!

So I did not anticipate that the re module when used in Jython would be nearly as slow as the shlex module. Does anyone know how to do this without the re module and instead with Java regular expressions or in some other clever way?

Upvotes: 2

Views: 1460

Answers (2)

YOU
YOU

Reputation: 123881

Not sure you really want to strip quotes on HI THERE part, this one include double quotes.

>>> import re
>>> x = '''CS ID=123 HD=CT NE="HI THERE"'''
>>> re.findall("""\w+="[^"]*"|\w+='[^']*'|\w+=\w+|\w+""", x)
['CS', 'ID=123', 'HD=CT', 'NE="HI THERE"']

without quotes on HI THERE part

>>> map(''.join,re.findall("""(\w+=)"([^"]*)"|(\w+=)'([^']*)'|(\w+=\w+)|(\w+)""", x))
['CS', 'ID=123', 'HD=CT', 'NE=HI THERE']

Upvotes: 3

Hyperboreus
Hyperboreus

Reputation: 32439

Not using regex, but iterating over the input (no idea if faster or slower):

def tokenize (s):
    quoted = False
    buff = ''
    for c in s:
        if c == ' ' and not quoted:
            yield buff
            buff = ''
            continue
        if c == '"':
            quoted = not quoted
            continue
        buff += c
    yield buff

a = 'CS ID=123 HD=CT NE="HI THERE"'
for t in tokenize (a): print (t)

Upvotes: 0

Related Questions