Reputation: 21443
I have a list of strings that are formatted as key/value pairs, separated by spaces. For example, a message may be:
"time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username subject=subject goes here message=This is a message"
The key/value pairs will always be in that order and the message will always be in this form. I want to convert this string into a dictionary in this form:
{'level': 1,
'message': 'This is a message',
'sequenceNum': 35,
'subject': 'subject goes here',
'time': '2016/06/14 16:44:00.000',
'user': 'Username'}
A couple things to note:
level
and sequenceNum
to be a numbers, not strings'message='
, which would make it impossible to distinguish where the subject ends and the message starts, that's great, but for now I'm willing to ignore that problem.Currently the best I have is this:
item = {}
item['time'] = message[5:message.index('level=')].strip()
message = message[message.index('level='):]
item['level'] = int(message[6:message.index('sequenceNum=')].strip())
message = message[message.index('sequenceNum='):]
#etc.
I don't really like this, even though it obviously works fine. I was hoping there was a more elegant way to do it based on string formatting. For example, if I were trying to create this string, I could use this:
"time=%s level=%s sequenceNum=%s user=%s subject=%s message=%s" % (item['time'], item['level'], item['sequenceNum'], item['user'], item['subject'], item['message'])
I'm wondering if it's possible to do it in the other direction.
Upvotes: 2
Views: 109
Reputation: 5855
For this I would go with regular expressions. That might not be the fastest (performance-wise) or the easiest (to understand) solution but it will certainly work. (And is probably the closest you will get to a "reverse-format")
import re
pattern = re.compile(
"time=(?P<time>.+)\s"
"level=(?P<level>\d+)\s"
"sequenceNum=(?P<sequenceNum>\d+)\s"
"user=(?P<user>\w+)\s"
"subject=(?P<subject>.+?)\s" # <-- EDIT: changed from greedy '.+' to non-greedy '.+?'
"message=(?P<message>.+)"
)
lines = ["time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username subject=subject goes here message=This is a message",
"time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username subject=subject goes here message=This is a message=hello"]
for line in lines:
match = pattern.match(line)
item = match.groupdict()
print(item)
To get the numbers as numbers you can then do something like result['level'] = int(result['level'])
.
If you are interested I can expand a bit on how I constructed the regular expression and how it could be improved.
EDIT: Changed expression to cover edge-case of message=
being in subject.
Upvotes: 2
Reputation: 23064
You can use a regular expression and re.findall()
to find each key-value-pair. The advantage of this method is that it should work with any string of key=value
pairs.
import re
data = ("time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 "
"user=Username subject=subject goes here message=This is a message")
matches = re.findall(r'(\w+)=([^=]+)(?:\s|\Z)', data)
{key: int(val) if val.isdigit() else val for key, val in matches}
In the output, all values that look like integers have been converted to int
.
{'level': 1,
'message': 'This is a message',
'sequenceNum': 35,
'subject': 'subject goes here',
'time': '2016/06/14 16:44:00.000',
'user': 'Username'}
If you didn't need to convert the numbers to integers, it would have been even simpler:
dict(re.findall(r'(\w+)=([^=]+)(?:\s|\Z)', data))
Here's a regex101 explanation of the regular expression (\w+)=([^=]+)(?:\s|\Z)
If your input data contains something like this: "subject=message=subject=message=subject"
, you have a problem, because it's ambiguous. You would have to sanitize the input, or just raise an exception.
if data.count('=') != 6:
raise ValidationError('malformed input data: {}'.format(data))
Upvotes: 2
Reputation: 38922
I came up with something to allows you use custom converters.
class Convert(object):
def __init__(self, *args):
self.content = ' '.join(args)
def sequenceNum(self):
return int(self.content)
def level(self):
return int(self.content)
def __getattr__(self, name):
def wrapper(*args, **kwargs):
return self.content
return wrapper
def line_to_dict(s):
r = {}
s_split = s.split('=')
l = len(s_split) - 1
k = None
for i, content in enumerate(s_split):
if not i:
k = content
continue
content_split = content.split()
if i == l:
r[k] = getattr(Convert(*content_split), k)()
else:
next_k = content_split.pop()
r[k] = getattr(Convert(*content_split), k)()
k = next_k
return r
if __name__ == "__main__":
print line_to_dict('time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username message=This is a message')
Upvotes: 0
Reputation: 77827
A little lateral thinking: split on the '=', which leaves each label at the end of the previous row. For instance:
in_stuff = "time=2016/06/14 16:44:00.000 level=1 sequenceNum=35"
+" user=Username message=This is a message"
skew = in_stuff.split('=')
table = [entry.split() for entry in skew]
out_dict = {table[i][-1] : ' '.join(table[i+1][:-1])
for i in range(len(table)-1)}
print out_dict
This isn't quite done, but illustrates the idea. Output:
{'sequenceNum': '35',
'level': '1',
'message': 'This is a',
'user': 'Username',
'time': '2016/06/14 16:44:00.000'}
You still need to convert the numbers, and recover the message's last word from the last row of table. I could do these in-line, but thought they'd clog the presentation a little.
Upvotes: 0
Reputation: 2492
ok, quickly threw something together
inputString = "time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username subject=subject goes here message=This is a message"
keys = []
#key is always the element before the '=' sign
for segment in inputString.split('='):
keys.append(segment.split(" ")[-1])
values = inputString.split("=")
for i in range(len(values)):
#split the values by the spaces
tmp = values[i].split(" ")
#and remove the last part --> the part before the equals sign
tmp.pop(-1)
#join them back together
values[i] = ' '.join(tmp)
#the first element is now empty, because there is no value before the first '='
values.pop(0)
#the last element will be missing in this case, because it will be interpretet as yet another key
if ' ' in inputString.split("=")[-1]:
values[-1] += ' '+inputString.split("=")[-1].split(' ')[-1]
else:
#if the last element does not contain a space it will be missing entirely --> adding it back in
values.pop(-1)
values += inputString.split("=")[-1]
# combining it to a dict
outputDict = dict(zip(keys, values))
print(outputDict)
Upvotes: 0