TommyVercetti
TommyVercetti

Reputation: 11

Python Regex - Capturing Groups of Repeating Patterns

I have a log file that I am trying to parse. Example of log file is below:

Oct 23 13:03:03.714012 prod1_xyz(RSVV)[201]: #msgtype=EVENT #server=Web/Dev@server1web #func=LKZ_WriteData ( line 2992 ) #rc=0 #msgid=XYZ0064 #reqid=0 #msg=Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0)

I want to pull out all the text that start with a hash, and have a key and value. For example, #msgtype=EVENT. Any text that has a hash only, and no "=" sign, will be treated as a value.

So in the above log entry, I want a list that looks like this

#msgtype=EVENT
#server=Web/Dev@server1web
#func=LKZ_WriteData ( line 2992 ) 
#rc=0
#msgid=XYZ0064 
#reqid=0
#msg=Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0) (Notice the hash present in the middle of the text)

I have tried the Python regex findall option, but I am not able to capture all data.

For example:

str='Oct 23 13:03:03.714012 prod1_xyz(RSVV)[201]: #msgtype=EVENT #server=Web/Dev@server1web #func=LKZ_WriteData ( line 2992 ) #rc=0 #msgid=XYZ0064 #reqid=0 #msg=Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0)'

z = re.findall("(#.+?=.+?)(:?#|$)",str)
print(z)

Output:

[('#msgtype=EVENT ', '#'), ('#func=LKZ_WriteData ( line 2992 ) ', '#'), ('#msgid=XYZ0064 ', '#'), ('#msg=Web Activity end (section 200, ', '#')]

Upvotes: 1

Views: 74

Answers (2)

seymourgoestohollywood
seymourgoestohollywood

Reputation: 1167

import re

s = "Oct 23 13:03:03.714012 prod1_xyz(RSVV)[201]: #msgtype=EVENT #server=Web/Dev@server1web #func=LKZ_WriteData ( line 2992 ) #rc=0 #msgid=XYZ0064 #reqid=0 #msg=Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0)"

a = re.findall('#(?=[a-zA-Z]+=).+?=.*?(?= #[a-zA-Z]+=|$)', s)

result = [item.split('=') for item in a]

print(result)

Gives:

[['#msgtype', 'EVENT'], ['#server', 'Web/Dev@server1web'], ['#func', 'LKZ_WriteData ( line 2992 )'], ['#rc', '0'], ['#msgid', 'XYZ0064'], ['#reqid', '0'], ['#msg', 'Web Activity end (section 200, # SysD 1, Files 222, Bytes 343422089928, Errors 0, Aborted Files 0, Busy Files 0)']]

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

The (:?#|$) is a capturing group that matches an optional : and then #, or end of string. Since re.findall returns all captured substrings the result is a list of tuples.

You need

re.findall(r'#[^\s=]+=.*?(?=\s*#[^\s=]+=|$)', text)

See the regex demo

Regex details

  • #[^\s=]+ - # and then any 1+ chars other than whitespace and =
  • = - a = char
  • .*? - any 0+ chars other than line break chars, as few as possible
  • (?=\s*#[^\s=]+=|$) - up to (and excluding) 0+ whitespaces, #, 1+ chars other than whitespace and = and then = or up the end of string.

Upvotes: 1

Related Questions