Reputation: 1177
I am trying to fetch all the text between two timestamps and this multiple times. Example:
2022/May/31 14:42:33.775887 blabla.
2022/May/31 14:42:33.775907 abc, id 2
2022/May/31 14:42:33.781586 blabla {
one
huge
block
of
text
}
2022/May/31 14:42:33.781982 def
So as you can see it is a log file with many timestamps and some are rather short lines and other have more text in between.
The goal would be to fetch each text between the two timestamps and modify/analyze if needed the text or discard it.
I have one approach but it is not what I want and it causes me problem when analyzing the text.
with open(infile) as fp:
for result in re.findall("\d{4}/\w{3}/\d{2}(.*?)\}", fp.read(), re.S):
print(result)
The above is working when the timestamp has a block of text {...} but when the timestamp has no block of text it will go until he finds '}' which would be wrong and for another timestamp.
So, when I run this with my above example it would give me:
2022/May/31 14:42:33.775887 blabla.
2022/May/31 14:42:33.775907 abc, id 2
2022/May/31 14:42:33.781586 blabla {
one
huge
block
of
text
}
I tried also this, but it doesn't give me anything:
with open(infile) as fp:
for result in re.findall("^\d{4}/\w{3}/\d{2}(.*?)\d{4}/\w{3}/\d{2}", fp.read(), re.S):
print(result)
Any clue what am I doing wrong?
The approach from 'The fourth bird' is working very beautiful:
pattern = r"^\d{4}/\w{3}/\d{2} (\d\d:\d\d:\d\d\.\d+ )(.*(?:\n(?!\d{4}/\w{3}/\d{2}\b).*)*)"
with open(infile) as fp:
for result in re.findall(pattern, fp.read(), re.M):
print(result)
This would give me (with the timestamp):
['14:42:33.775887 blabla.', '14:42:33.775907 abc, id 2', '14:42:33.781586 blabla {\none\nhuge\nblock\nof\ntext\n}', '14:42:33.781982 def']
Upvotes: 2
Views: 240
Reputation: 163467
Without using re.S
you can capture the text between the timestamps checking the start of the lines.
If you also don't want to include the time part, you can match that as well before starting the capture group.
Using re.findall will return the capture group 1 values.
^\d{4}/\w{3}/\d{2} \d\d:\d\d:\d\d\.\d+ (.*(?:\n(?!\d{4}/\w{3}/\d{2}\b).*)*)
import re
pattern = r"^\d{4}/\w{3}/\d{2} \d\d:\d\d:\d\d\.\d+ (.*(?:\n(?!\d{4}/\w{3}/\d{2}\b).*)*)"
s = ("2022/May/31 14:42:33.775887 blabla.\n"
"2022/May/31 14:42:33.775907 abc, id 2\n"
"2022/May/31 14:42:33.781586 blabla {\n"
"one\n"
"huge\n"
"block\n"
"of\n"
"text\n"
"}\n"
"2022/May/31 14:42:33.781982 def")
print(re.findall(pattern, s, re.M))
Output
['blabla.', 'abc, id 2', 'blabla {\none\nhuge\nblock\nof\ntext\n}', 'def']
Upvotes: 1