Reputation: 687
my goal is to create a text parser for file containing multilines data:
Applying option loglevel (set logging level) with argument debug.
Successfully parsed a group of options.
Parsing a group of options: input url http://prod7.team.cn/test/tracks-v1a1/mono.
Successfully parsed a group of options.
Opening an input file: http://prod7.team.cn/test/tracks-v1a1/mono
[NULL @ 000001e002039000] Opening 'http://prod7.team.cn/test/tracks-v1a1/mono' for reading
[http @ 000001e00203a040] Setting default whitelist 'http,https,tls,rtp,tcp,udp,crypto,httpproxy'
[tcp @ 000001e00203ba80] Original list of addresses:
[tcp @ 000001e00203ba80] Address 92.223.97.22 port 80
[tcp @ 000001e00203ba80] Interleaved list of addresses:
[tcp @ 000001e00203ba80] Address 92.223.97.22 port 80
[tcp @ 000001e00203ba80] Starting connection attempt to 92.223.97.22 port 80
[tcp @ 000001e00203ba80] Successfully connected to 92.223.97.22 port 80
[http @ 000001e00203a040] request: GET /test/tracks-v1a1/mono HTTP/1.1
User-Agent: Lavf/58.31.101
Accept: */*
Range: bytes=0-
Connection: close
Host: prod7.team.cn
Icy-MetaData: 1
each files contain multiple set of such information. My target is to find every "Successfully conneted" IP address, followed by the HOST detail, till LF.
In the case mentioned a valid match should be IP 92.223.97.22 HOST prod7.team.cn
I can easily find the IP using a regex, but I don't understand how to create a valid match, skipping further lines till "host".
If I use this Regex
(connected to).([0-9].(?:\.[0-9]+){3}.port.*.*)
I find:
Match 1
Full match connected to 92.223.97.22 port 80
Group 1. connected to
Group 2. 92.223.97.22 port 80
I'm receiving error if I add .* or .host.* at the end. I'm confused how to add another pattergn to detect 'Host:' and get match until end of row.
Upvotes: 0
Views: 116
Reputation: 687
I was able to sort out using nested Regex:
ip_list = []
regex = r'connected(.*?)Host[^\n]+$'
text_as_string = open('C:\\temp\\log.txt', 'r').read()
matches = re.finditer(regex, text_as_string, re.DOTALL | re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
block = str(match.group())
#print connected IP
ip = re.compile('(connected to).[0-9]+(?:\.[0-9]+){3}.port.*')
for match in re.finditer(ip, block):
f_id=match.group()
#print connected host
host = re.compile('Host[^\n]+$')
for match in re.finditer(host, block):
f_host=match.group()
if f_id =='':
f_id='NA'
if f_host =='':
f_host='NA'
ip_list.append([f_id,f_host])
unique_ip = reduce(lambda l, x: l if x in l else l+[x], ip_list, [])
Upvotes: 0
Reputation: 1403
https://docs.python.org/3.7/library/re.html#re.MULTILINE
You want to run your regex in MULTILINE mode which should allow you to match over line breaks. Then you could use something like .*
to capture the in-between.
A caveat to notice is that you should be sure to check to make a sure you don't run into a new matching start. Like CA.*B
would match both CAB and CACB and CACAB. So most likely will want to explicitly check in your regex to not overrun the beginning of a valid match with the .*
.
Upvotes: 1