Regex to parse delimited string with key/value pairs (python)

Question

I have data in text format, where key/value pairs are separated by semi-colon, may be followed by whitespace, maybe not, e.g., ";" or "; ", or even " ; ". There will always be a semi-colon between pairs, and the string is terminated with a semi-colon.

Keys and values are separated by whitespace.

This string is flat. There's never anything nested. Strings are always quoted and numerical values are never quoted. I can count on this being consistent in the input. So for example,

'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'

Ultimately this winds up as

{'cheese': "stilton", 'pigeons': 17, 'color': "blue"; 'why': "because I said so"}

Different strings may include different key/value pairs, and I can't know in advance which keys will be present. So this is equally valid input string:

mass 6.02 ; mammal "gerbil";telephone "+1 903 555-1212"; size "A1";

I'm thinking that a regex to split the string into a list would be a good start, then just iterate through the list by twos to build the dictionary. Something like

x = PATTERN.split(s)
d = {}
for i in range(0, len(x), 2):
    d[x[i]] = d[x[i+1]]

Which requires a list like ['cheese', 'stilton', 'pigeons', 17, 'color', 'blue', 'why', 'because I said so']. But I can't figure out a regex to get in this form. Closest I have is

([^;[\s]*]+)

Which returns

['', 'cheese', ' ', '"stilton"', ';', 'pigeons', ' ', '17', '; ', 'color', ' ', '"blue"', '; ', 'why', ' ', '"because', ' ', 'I', ' ', 'said', ' ', 'so"', ';']

Of course, it's easy enough to iterate by threes and pick the key/value pairs and ignore the captured delimiters, but I'm wondering if there's a different regex that would not capture the delimiters. Any suggestions?

Mark · Accepted Answer

It might be easier to use findall() instead of split() here. This will let you use a capture group to pull out just the part you want. Then you can split the groups, cleanup, etc:

import re
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
pairs = re.findall(r'(\S+?) (.+?);', s)

d = {}
for k, v in pairs:
    if  v.isdigit():
        v = int(v)
    else:
        v = v.strip('"')
    d[k] = v
print(d)

result

{'cheese': 'stilton',
 'pigeons': 17,
 'color': 'blue',
 'why': 'because I said so'}

This, of course, assumes you aren't using ; anywhere in the data.

Regex to parse delimited string with key/value pairs (python)

Answers (2)

Related Questions