icomefromchaos
icomefromchaos

Reputation: 235

Regular expression to parse log.

I'm trying to write a regular expression to parse out an old IRC log that I have.

Regular Expression:

  (\d\d:\d\d)(<)(@|\+)(.+?)>(.*)

LOG Example:

= 00:00<@billy> text text text text text text text text text text text text text text text 
= 00:03<+tom> text text text text text text 
= 00:03<somedude> text text

I've been able to parse out everything that I need from the log except for users that do not have operator(@) or voice(+) status in the channel.

Thus, when I run the regex I get the following:

[('00:00', '<', '@', 'bill', " text text text text text text text text text text text text text text text ")]
[('00:00', '<', '+', 'tom', " text text text text text text ]
[]

Hence, 'somedude' is missing. Would anyone have any hints on how to better approach this?

Upvotes: 1

Views: 788

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626690

The main point is to make @ or + optional by adding ? after (@|\+), or - better - [@+] => [@+]?. Note you do not need to escape + in the character class as it matches a literal plus symbol inside the class.

In Python 3, I suggest using the regex with named capturing groups.

import re
ss = [ '= 00:00<@billy> text text text text text text text text text text text text text text text ',
'= 00:03<+tom> text text text text text text ',
'= 00:03<somedude> text text']
for s in ss:
    m = re.search(r'(?P<time>\d{2}:\d{2})<(?P<user>[@+]?[^>]*)>(?P<message>.*)', s)
    if m:
        print(m.groupdict())

See the Python demo online, output:

{'time': '00:00', 'message': ' text text text text text text text text text text text text text text text ', 'user': '@billy'}
{'time': '00:03', 'message': ' text text text text text text ', 'user': '+tom'}
{'time': '00:03', 'message': ' text text', 'user': 'somedude'}

Pattern details

  • (?P<time>\d{2}:\d{2}) - Group "time": 2 digits, :, 2 digits
  • < - a <
  • (?P<user>[@+]?[^>]*) - Group "user": 1 or 0 @ or +, and then any 0+ chars other than >
  • > - a >
  • (?P<message>.*) - Group "message": any 0+ chars, up to the end of the line

Upvotes: 1

Related Questions