Match hour:minute time format with regex in Python

Question

I am parsing a PDF banking statement file one line at a time. The problem is that the format is not always consistent.

Sometimes I have a string as this:

'Received from SOMEONE           11/02/2020   13∶20     $ 997,63   $ 997,63            -'

I am using Python 3 and I need to split the string using a time regex so I can have 2 strings, so the expected result would be:

['Received from SOMEONE           11/02/2020   ', '13∶20     $ 997,63   $ 997,63            -'

Among many others, I have tested the following regexes:

r"\s+(?=\d+\d+:\d+\d\s)"
r"(?:(?:(\d+):)?(\d+))"
r"(2[0-3]|[01]?[0-9]):([0-5]?[0-9])"
r"(?:([01]?\d|2[0-3]):([0-5]?\d))"

Could anyone please help me with the right regex to achieve what I need?

Thanks a lot.

The fourth bird · Accepted Answer

You could split using a lookahead that asserts a time like pattern:

(?=(?:\b[01]\d|2[0-3]):[0-5]\d\b)

Regex demo

Note that when I copy the example data the ∶ is this char https://www.compart.com/en/unicode/U+2236 and in the regex I have used this : char https://www.compart.com/en/unicode/U+003A

If you want to match both, you could use a character class [:∶]

(?=(?:\b[01]\d|2[0-3])[:∶][0-5]\d\b)

Regex demo

import re

regex = r"(?=(?:\b[01]\d|2[0-3]):[0-5]\d\b)"

test_str = "Received from SOMEONE           11/02/2020   13:20     $ 997,63   $ 997,63            -"
print(re.split(regex, test_str))

Output

['Received from SOMEONE           11/02/2020   ', '13:20     $ 997,63   $ 997,63            -']

Match hour:minute time format with regex in Python

Answers (2)

Related Questions