PartialOrder
PartialOrder

Reputation: 2960

Regex to parse delimited string with key/value pairs (python)

I have data in text format, where key/value pairs are separated by semi-colon, may be followed by whitespace, maybe not, e.g., ";" or "; ", or even " ; ". There will always be a semi-colon between pairs, and the string is terminated with a semi-colon.

Keys and values are separated by whitespace.

This string is flat. There's never anything nested. Strings are always quoted and numerical values are never quoted. I can count on this being consistent in the input. So for example,

'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'

Ultimately this winds up as

{'cheese': "stilton", 'pigeons': 17, 'color': "blue"; 'why': "because I said so"}

Different strings may include different key/value pairs, and I can't know in advance which keys will be present. So this is equally valid input string:

mass 6.02 ; mammal "gerbil";telephone "+1 903 555-1212"; size "A1";

I'm thinking that a regex to split the string into a list would be a good start, then just iterate through the list by twos to build the dictionary. Something like

x = PATTERN.split(s)
d = {}
for i in range(0, len(x), 2):
    d[x[i]] = d[x[i+1]]

Which requires a list like ['cheese', 'stilton', 'pigeons', 17, 'color', 'blue', 'why', 'because I said so']. But I can't figure out a regex to get in this form. Closest I have is

([^;[\s]*]+)

Which returns

['', 'cheese', ' ', '"stilton"', ';', 'pigeons', ' ', '17', '; ', 'color', ' ', '"blue"', '; ', 'why', ' ', '"because', ' ', 'I', ' ', 'said', ' ', 'so"', ';']

Of course, it's easy enough to iterate by threes and pick the key/value pairs and ignore the captured delimiters, but I'm wondering if there's a different regex that would not capture the delimiters. Any suggestions?

Upvotes: 1

Views: 1769

Answers (2)

Mark
Mark

Reputation: 92440

It might be easier to use findall() instead of split() here. This will let you use a capture group to pull out just the part you want. Then you can split the groups, cleanup, etc:

import re
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
pairs = re.findall(r'(\S+?) (.+?);', s)

d = {}
for k, v in pairs:
    if  v.isdigit():
        v = int(v)
    else:
        v = v.strip('"')
    d[k] = v
print(d)

result

{'cheese': 'stilton',
 'pigeons': 17,
 'color': 'blue',
 'why': 'because I said so'}

This, of course, assumes you aren't using ; anywhere in the data.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

You may use

r'(\w+)\s+("[^"]*"|[^\s;]+)'

to match and extract your data with re.findall, and post-process Group 2 values to remove one trailing and one leading " chars if the first alternative matched, and then create a dictionary entry.

See the regex demo.

Details

  • (\w+) - Group 1 (key): one or more word chars
  • \s+ - 1+ whitespace chars
  • ("[^"]*"|[^\s;]+) - Group 2: ", 0+ chars other than " and then a " or 1 or more chars other than whitespace and ;

Python demo:

import re
rx = r'(\w+)\s+("[^"]*"|[^\s;]+)'
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
result = {}
for key,val in re.findall(rx, s):
    if val.startswith('"') and val.endswith('"'):
        val = val[1:-1]
    result[key]=val

print(result)

Upvotes: 1

Related Questions