Inbar Rose
Inbar Rose

Reputation: 43447

very specific substring retrieval and split

i know there are tons of posts about sub-stringing, believe me i have searched through many of them looking for an answer to this.

i have many strings, lines from a log, and i am trying to categorize and parse them.

they look something like this:

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message

where the filename is the file where the log is located, the date is the date/time that the message was put into the log, and the TYPE is the type of message, and then the message is composed of two parts, a static part and a dynamic part, the static part does not change for the message and the dynamic part can change (obviously) and they are split by a ; but there can be more ; in the dynamic part.

i want to be able to extract the Static Message, and the Dynamic Message.

so far i have been using something like this:

parts = line.split(";")
static = parts[0]
dynamic = ";".join(parts[1:])

not very pretty. and also my static part contains the filename and the date and the type, which i do not want. so then i thought i would do something like this:

parts = " ".join(":".join(line.split(":")[1:]).split(" ")[4:]).split(";")
static = parts[0]
dynamic = ";".join(parts[1:])

which i have tried, and it works to some extent, except sometimes the filename might have a space, or the TYPE might have a space or something isnt working properly and i sometimes get the TYPE as part of the static message... efficiency is an issue since these are thousands of lines of logs which must be parsed and categorized daily. so i am wondering if there is a better way to do this other than this hack-job??

edit: i thought i would provide more examples of lines in the log. to fix what i said earlier, there are a few types of entries.

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAME TYPE THREAD.OR.CONNECTION.INFORMATION Static Message;Dynamic Message

so as you can see - there are some two types of log entries. those without modules and those with, those with modules can either be connected to connections, and some to threads. this makes the parsing harder.

Upvotes: 1

Views: 113

Answers (2)

Pierre GM
Pierre GM

Reputation: 20339

You could try something like:

>>> regexp = re.compile("^([\/.\w]*)\:(\w{3}\s\d{2}\s\d{2}\:\d{2}\:\d{2})\s([A-Z]*)\s([\w\s]*)\;([\w\s]*)$")
>>> regexp.match(line).groups()
('/long/file/name/with.dots.and.extension', 'Jan 01 12:00:00', 'TYPE', 'Static Message', 'Dynamic Message')

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1122222

You can limit the split to the first ';' only:

static, dynamic = line.split(';', 1)

Your static part splitting might take a little more doing, but if you know the number of spaces is going to be static in the first part, perhaps the same trick could work there:

static = static.split(' ', 4)[-1]

If the first part of the line is more complex (spaces in the TYPE part) I fear that removing everything before that is going to be a more difficult affair. Your best bet is to figure out the limited set of values TYPE could assume and to use a regular expression with that information to split the static part.

Upvotes: 1

Related Questions