FotisK
FotisK

Reputation: 1177

how to fetch all text between two timestamps with python mutilple times and keeping the timestamp too

I am trying to fetch all the text between two timestamps and this multiple times. Example:

2022/May/31 14:42:33.775887 blabla.
2022/May/31 14:42:33.775907 abc, id 2
2022/May/31 14:42:33.781586 blabla {
one
huge
block
of
text
}
2022/May/31 14:42:33.781982 def

So as you can see it is a log file with many timestamps and some are rather short lines and other have more text in between.

The goal would be to fetch each text between the two timestamps and modify/analyze if needed the text or discard it.

I have one approach but it is not what I want and it causes me problem when analyzing the text.

with open(infile) as fp:
    for result in re.findall("\d{4}/\w{3}/\d{2}(.*?)\}", fp.read(), re.S):
        print(result)

The above is working when the timestamp has a block of text {...} but when the timestamp has no block of text it will go until he finds '}' which would be wrong and for another timestamp.

So, when I run this with my above example it would give me:

2022/May/31 14:42:33.775887 blabla.
2022/May/31 14:42:33.775907 abc, id 2
2022/May/31 14:42:33.781586 blabla {
one
huge
block
of
text
}

I tried also this, but it doesn't give me anything:

with open(infile) as fp:
    for result in re.findall("^\d{4}/\w{3}/\d{2}(.*?)\d{4}/\w{3}/\d{2}", fp.read(), re.S):
        print(result)

Any clue what am I doing wrong?

The approach from 'The fourth bird' is working very beautiful:

pattern = r"^\d{4}/\w{3}/\d{2} (\d\d:\d\d:\d\d\.\d+ )(.*(?:\n(?!\d{4}/\w{3}/\d{2}\b).*)*)"
with open(infile) as fp:
    for result in re.findall(pattern, fp.read(), re.M):
        print(result)

This would give me (with the timestamp):

['14:42:33.775887 blabla.', '14:42:33.775907 abc, id 2', '14:42:33.781586 blabla {\none\nhuge\nblock\nof\ntext\n}', '14:42:33.781982 def']

Upvotes: 2

Views: 240

Answers (1)

The fourth bird
The fourth bird

Reputation: 163467

Without using re.S you can capture the text between the timestamps checking the start of the lines.

If you also don't want to include the time part, you can match that as well before starting the capture group.

Using re.findall will return the capture group 1 values.

^\d{4}/\w{3}/\d{2} \d\d:\d\d:\d\d\.\d+ (.*(?:\n(?!\d{4}/\w{3}/\d{2}\b).*)*)

Regex demo

import re

pattern = r"^\d{4}/\w{3}/\d{2} \d\d:\d\d:\d\d\.\d+ (.*(?:\n(?!\d{4}/\w{3}/\d{2}\b).*)*)"

s = ("2022/May/31 14:42:33.775887 blabla.\n"
            "2022/May/31 14:42:33.775907 abc, id 2\n"
            "2022/May/31 14:42:33.781586 blabla {\n"
            "one\n"
            "huge\n"
            "block\n"
            "of\n"
            "text\n"
            "}\n"
            "2022/May/31 14:42:33.781982 def")

print(re.findall(pattern, s, re.M))

Output

['blabla.', 'abc, id 2', 'blabla {\none\nhuge\nblock\nof\ntext\n}', 'def']

Upvotes: 1

Related Questions