ewok
ewok

Reputation: 21443

Python: pull fields from formatted string

I have a list of strings that are formatted as key/value pairs, separated by spaces. For example, a message may be:

"time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username subject=subject goes here message=This is a message"

The key/value pairs will always be in that order and the message will always be in this form. I want to convert this string into a dictionary in this form:

{'level': 1,
 'message': 'This is a message',
 'sequenceNum': 35,
 'subject': 'subject goes here',
 'time': '2016/06/14 16:44:00.000',
 'user': 'Username'}

A couple things to note:

  1. I want level and sequenceNum to be a numbers, not strings
  2. there can be spaces in the timestamp, the subject, and the message, so I can't just split on spaces
  3. the message and subject may contain anything, so I can't split on the labels or the equal sign either. They will however always be the 2nd to last and last things in the string. If we can solve the issue of the subject potentially containing the string 'message=', which would make it impossible to distinguish where the subject ends and the message starts, that's great, but for now I'm willing to ignore that problem.

Currently the best I have is this:

item = {}
item['time'] = message[5:message.index('level=')].strip()
message = message[message.index('level='):]
item['level'] = int(message[6:message.index('sequenceNum=')].strip())
message = message[message.index('sequenceNum='):]
#etc.

I don't really like this, even though it obviously works fine. I was hoping there was a more elegant way to do it based on string formatting. For example, if I were trying to create this string, I could use this:

"time=%s level=%s sequenceNum=%s user=%s subject=%s message=%s" % (item['time'], item['level'], item['sequenceNum'], item['user'], item['subject'], item['message'])

I'm wondering if it's possible to do it in the other direction.

Upvotes: 2

Views: 109

Answers (5)

PeterE
PeterE

Reputation: 5855

For this I would go with regular expressions. That might not be the fastest (performance-wise) or the easiest (to understand) solution but it will certainly work. (And is probably the closest you will get to a "reverse-format")

import re

pattern = re.compile(
    "time=(?P<time>.+)\s"
    "level=(?P<level>\d+)\s"
    "sequenceNum=(?P<sequenceNum>\d+)\s"
    "user=(?P<user>\w+)\s"
    "subject=(?P<subject>.+?)\s"    # <-- EDIT: changed from greedy '.+' to non-greedy '.+?'
    "message=(?P<message>.+)"
    )

lines = ["time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username subject=subject goes here message=This is a message",
         "time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username subject=subject goes here message=This is a message=hello"]

for line in lines:
    match = pattern.match(line)
    item = match.groupdict()
    print(item)

To get the numbers as numbers you can then do something like result['level'] = int(result['level']).

If you are interested I can expand a bit on how I constructed the regular expression and how it could be improved.

EDIT: Changed expression to cover edge-case of message= being in subject.

Upvotes: 2

H&#229;ken Lid
H&#229;ken Lid

Reputation: 23064

You can use a regular expression and re.findall() to find each key-value-pair. The advantage of this method is that it should work with any string of key=value pairs.

import re
data = ("time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 "
        "user=Username subject=subject goes here message=This is a message")
matches =  re.findall(r'(\w+)=([^=]+)(?:\s|\Z)', data)
{key: int(val) if val.isdigit() else val for key, val in matches}

In the output, all values that look like integers have been converted to int.

{'level': 1,
'message': 'This is a message',
'sequenceNum': 35,
'subject': 'subject goes here',
'time': '2016/06/14 16:44:00.000',
'user': 'Username'}

If you didn't need to convert the numbers to integers, it would have been even simpler:

dict(re.findall(r'(\w+)=([^=]+)(?:\s|\Z)', data))

Here's a regex101 explanation of the regular expression (\w+)=([^=]+)(?:\s|\Z)

If your input data contains something like this: "subject=message=subject=message=subject", you have a problem, because it's ambiguous. You would have to sanitize the input, or just raise an exception.

if data.count('=') != 6:
    raise ValidationError('malformed input data: {}'.format(data))

Upvotes: 2

Oluwafemi Sule
Oluwafemi Sule

Reputation: 38922

I came up with something to allows you use custom converters.

class Convert(object):
    def __init__(self, *args):
        self.content = ' '.join(args)

    def sequenceNum(self):
        return int(self.content)

    def level(self):
        return int(self.content)

    def __getattr__(self, name):
        def wrapper(*args, **kwargs):
            return self.content
        return wrapper


def line_to_dict(s):
    r = {}
    s_split = s.split('=')
    l = len(s_split) - 1
    k = None

    for i, content in enumerate(s_split):
        if not i:
            k = content
            continue

        content_split = content.split()

        if i == l:
            r[k] = getattr(Convert(*content_split), k)()
        else:
            next_k = content_split.pop()

        r[k] = getattr(Convert(*content_split), k)()

        k = next_k

    return r


if __name__ == "__main__":
    print line_to_dict('time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username message=This is a message')

Upvotes: 0

Prune
Prune

Reputation: 77827

A little lateral thinking: split on the '=', which leaves each label at the end of the previous row. For instance:

in_stuff = "time=2016/06/14 16:44:00.000 level=1 sequenceNum=35"
           +" user=Username message=This is a message"
skew = in_stuff.split('=')
table = [entry.split() for entry in skew]
out_dict = {table[i][-1] : ' '.join(table[i+1][:-1]) 
            for i in range(len(table)-1)}
print out_dict

This isn't quite done, but illustrates the idea. Output:

{'sequenceNum': '35',
 'level': '1',
 'message': 'This is a',
 'user': 'Username',
 'time': '2016/06/14 16:44:00.000'}

You still need to convert the numbers, and recover the message's last word from the last row of table. I could do these in-line, but thought they'd clog the presentation a little.

Upvotes: 0

Hans
Hans

Reputation: 2492

ok, quickly threw something together

inputString = "time=2016/06/14 16:44:00.000 level=1 sequenceNum=35 user=Username subject=subject goes here message=This is a message"

keys = []
#key is always the element before the '=' sign
for segment in inputString.split('='):
    keys.append(segment.split(" ")[-1])

values = inputString.split("=")
for i in range(len(values)):
    #split the values by the spaces
    tmp = values[i].split(" ")
    #and remove the last part --> the part before the equals sign
    tmp.pop(-1)
    #join them back together
    values[i] = ' '.join(tmp)
#the first element is now empty, because there is no value before the first '='
values.pop(0)

#the last element will be missing in this case, because it will be interpretet as yet another key
if ' ' in inputString.split("=")[-1]:
    values[-1] += ' '+inputString.split("=")[-1].split(' ')[-1]
else:
    #if the last element does not contain a space it will be missing entirely --> adding it back in
    values.pop(-1)
    values += inputString.split("=")[-1]

# combining it to a dict
outputDict = dict(zip(keys, values))
print(outputDict)

Upvotes: 0

Related Questions