UserA
UserA

Reputation: 47

Regex findall output not as expected

Tried Regex to extract parts of text which is read from a .txt file. However my method seems to fail some specific lines.

Below are 3 lines from input text

[2019/07/11 18:52:25.391] Receive : <- AI (Req No. 711185105702666 ) Message from : cop10

[2019/07/11 18:52:25.391] Note    : Response that is not being sent ... cop10

[2019/07/11 18:52:25.393] ★Err    : subargs[0] : IBSDK_7776

below is code to extract some portion of text after the time stamp.

regex = r"\[.{23}] ?(.{1,8}:.{1,12}).*\n"
pattern = re.compile(regex)
for line in input_text: 
    matches = pattern.findall(line)
    print('matches is {}'.format(matches))

"For lines 1 and 2 in the input text, the output is as expected i.e a list of extracted text."

Shown below is the output for line 1

"matches is ['Receive : <- AI (Req ']"

"For the last line the list is empty i.e [ ]"

"My expectation was ['★Err : subargs[0]'] or list of some text."

I suspect it could be due to the black star in the text as those are places where the code snippet fails,but am not fully sure why it happens.

Would be great if I can get some input on this and if I need to make changes to my Regex.

Upvotes: 2

Views: 73

Answers (1)

The fourth bird
The fourth bird

Reputation: 163287

The reason the last line is not being matched is because there is no newline after the last line.

If you want to keep your current pattern you might assert the end of the string $

Your code might look like

regex = r"\[.{23}] ?(.{1,8}:.{1,12}).*$"

Regex demo

The current pattern does not take a timestamp format into account, it matches 23 times any char except a newline between [ and ].

You might update your pattern to match your current timestamp format (it does not validate the timestamp), use a negated character class [^:]+: after to match until the : and perhaps omit the match after the capturing group:

\[\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}\.\d{3}] ?([^:]+:.{1,12})

Regex demo

Upvotes: 2

Related Questions