Reputation: 629
I have a python program that takes file from many sources, all files from the same source have the same format, but the formats vary greatly. One source can be in the format ServerName - ProccessID - Date another could be (Date)_Username_ProccessID_Server. Currently, to add a new source, with a new format requires a coder to write a parse function for each source.
I've started writing a new adapter, and i'd like to store the file format as a string. the like first one would be %S - %P - %D, the second could be like (%D)%U%P_%S.
What would the best approach for this in python3?
Upvotes: 1
Views: 48
Reputation: 7555
Something like this would be reasonable:
import re
from collections import namedtuple
Format = namedtuple('Format', 'name format_string regex')
class Parser(object):
replacements = [Format('server', '%S', r'[A-Za-z0-9]+'),
Format('user', '%U', r'[A-Za-z0-9]+'),
Format('date', '%D', r'[0-9]{4}-[0-9]{2}-[0-9]{2}'),
Format('process_id', '%P', r'[0-9]+'),
]
def __init__(self, format):
self.format = format
self.re = re.compile(self._create_regex(format))
def _create_regex(self, format):
format = re.escape(format)
for replacement in self.replacements:
format = format.replace(r'\%s' % replacement.format_string,
r'(?P<%s>%s)' % (replacement.name,
replacement.regex,
),
)
return format
def parse(self, data):
match = self.re.match(data)
if match:
return match.groupdict()
return None
Usage:
a_parser = Parser("(%D)%U_%P_%S")
print a_parser.parse("(2005-04-12)Jamie_123_Server1")
b_parser = Parser("%S - %P - %D")
print b_parser.parse("Server1 - 123 - 2005-04-12")
Output:
{'date': '2005-04-12', 'process_id': '123', 'user': 'Jamie', 'server': 'Server1'}
{'date': '2005-04-12', 'process_id': '123', 'server': 'Server1'}
Essentially, I'm creating a mapping between the %?
s in your custom format syntax and a predefined regular expression to match that parameter, then replacing the %?
strings in the given format string with the corresponding regex to build a parser for that pattern.
This will only work if the characters that delimit a "type" in the format string don't appear in it's regex, or if there's no delimiter, then that the two regex's that are side-by-side don't "interfere" with each other. For example, with the format string:
%U%P
And the regexs I've assigned to user
and process_id
above, it's impossible tell where user
ends and process_id
starts in this string:
User1234
Is that User1
and 234
or User
and 1234
, or any other combination? But then, even a human can't work that out!
Upvotes: 2