KevinA
KevinA

Reputation: 629

Python Emulate strptime behavor

I have a python program that takes file from many sources, all files from the same source have the same format, but the formats vary greatly. One source can be in the format ServerName - ProccessID - Date another could be (Date)_Username_ProccessID_Server. Currently, to add a new source, with a new format requires a coder to write a parse function for each source.

I've started writing a new adapter, and i'd like to store the file format as a string. the like first one would be %S - %P - %D, the second could be like (%D)%U%P_%S.

What would the best approach for this in python3?

Upvotes: 1

Views: 48

Answers (1)

Jamie Cockburn
Jamie Cockburn

Reputation: 7555

Something like this would be reasonable:

import re
from collections import namedtuple

Format = namedtuple('Format', 'name format_string regex')
class Parser(object):
    replacements = [Format('server', '%S', r'[A-Za-z0-9]+'),
                    Format('user', '%U', r'[A-Za-z0-9]+'),
                    Format('date', '%D', r'[0-9]{4}-[0-9]{2}-[0-9]{2}'),
                    Format('process_id', '%P', r'[0-9]+'),
                    ]

    def __init__(self, format):
        self.format = format
        self.re = re.compile(self._create_regex(format))

    def _create_regex(self, format):
        format = re.escape(format)
        for replacement in self.replacements:
            format = format.replace(r'\%s' % replacement.format_string,
                                    r'(?P<%s>%s)' % (replacement.name,
                                                     replacement.regex,
                                                     ),
                                    )
        return format

    def parse(self, data):
        match = self.re.match(data)
        if match:
            return match.groupdict()
        return None

Usage:

a_parser = Parser("(%D)%U_%P_%S")
print a_parser.parse("(2005-04-12)Jamie_123_Server1")

b_parser = Parser("%S - %P - %D")
print b_parser.parse("Server1 - 123 - 2005-04-12")

Output:

{'date': '2005-04-12', 'process_id': '123', 'user': 'Jamie', 'server': 'Server1'}
{'date': '2005-04-12', 'process_id': '123', 'server': 'Server1'}

Essentially, I'm creating a mapping between the %?s in your custom format syntax and a predefined regular expression to match that parameter, then replacing the %? strings in the given format string with the corresponding regex to build a parser for that pattern.

This will only work if the characters that delimit a "type" in the format string don't appear in it's regex, or if there's no delimiter, then that the two regex's that are side-by-side don't "interfere" with each other. For example, with the format string:

%U%P

And the regexs I've assigned to user and process_id above, it's impossible tell where user ends and process_id starts in this string:

User1234

Is that User1 and 234 or User and 1234, or any other combination? But then, even a human can't work that out!

Upvotes: 2

Related Questions