Reputation: 2960
I have data in text format, where key/value pairs are separated by semi-colon, may be followed by whitespace, maybe not, e.g., ";" or "; ", or even " ; ". There will always be a semi-colon between pairs, and the string is terminated with a semi-colon.
Keys and values are separated by whitespace.
This string is flat. There's never anything nested. Strings are always quoted and numerical values are never quoted. I can count on this being consistent in the input. So for example,
'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
Ultimately this winds up as
{'cheese': "stilton", 'pigeons': 17, 'color': "blue"; 'why': "because I said so"}
Different strings may include different key/value pairs, and I can't know in advance which keys will be present. So this is equally valid input string:
mass 6.02 ; mammal "gerbil";telephone "+1 903 555-1212"; size "A1";
I'm thinking that a regex to split the string into a list would be a good start, then just iterate through the list by twos to build the dictionary. Something like
x = PATTERN.split(s)
d = {}
for i in range(0, len(x), 2):
d[x[i]] = d[x[i+1]]
Which requires a list like ['cheese', 'stilton', 'pigeons', 17, 'color', 'blue', 'why', 'because I said so']. But I can't figure out a regex to get in this form. Closest I have is
([^;[\s]*]+)
Which returns
['', 'cheese', ' ', '"stilton"', ';', 'pigeons', ' ', '17', '; ', 'color', ' ', '"blue"', '; ', 'why', ' ', '"because', ' ', 'I', ' ', 'said', ' ', 'so"', ';']
Of course, it's easy enough to iterate by threes and pick the key/value pairs and ignore the captured delimiters, but I'm wondering if there's a different regex that would not capture the delimiters. Any suggestions?
Upvotes: 1
Views: 1769
Reputation: 92440
It might be easier to use findall()
instead of split()
here. This will let you use a capture group to pull out just the part you want. Then you can split the groups, cleanup, etc:
import re
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
pairs = re.findall(r'(\S+?) (.+?);', s)
d = {}
for k, v in pairs:
if v.isdigit():
v = int(v)
else:
v = v.strip('"')
d[k] = v
print(d)
result
{'cheese': 'stilton',
'pigeons': 17,
'color': 'blue',
'why': 'because I said so'}
This, of course, assumes you aren't using ;
anywhere in the data.
Upvotes: 1
Reputation: 626758
You may use
r'(\w+)\s+("[^"]*"|[^\s;]+)'
to match and extract your data with re.findall
, and post-process Group 2 values to remove one trailing and one leading "
chars if the first alternative matched, and then create a dictionary entry.
See the regex demo.
Details
(\w+)
- Group 1 (key): one or more word chars\s+
- 1+ whitespace chars ("[^"]*"|[^\s;]+)
- Group 2: "
, 0+ chars other than "
and then a "
or 1 or more chars other than whitespace and ;
import re
rx = r'(\w+)\s+("[^"]*"|[^\s;]+)'
s = 'cheese "stilton";pigeons 17; color "blue"; why "because I said so";'
result = {}
for key,val in re.findall(rx, s):
if val.startswith('"') and val.endswith('"'):
val = val[1:-1]
result[key]=val
print(result)
Upvotes: 1