Albert Vonpupp
Albert Vonpupp

Reputation: 4867

Match hour:minute time format with regex in Python

I am parsing a PDF banking statement file one line at a time. The problem is that the format is not always consistent.

Sometimes I have a string as this:

'Received from SOMEONE           11/02/2020   13∶20     $ 997,63   $ 997,63            -'

I am using Python 3 and I need to split the string using a time regex so I can have 2 strings, so the expected result would be:

['Received from SOMEONE           11/02/2020   ', '13∶20     $ 997,63   $ 997,63            -'

Among many others, I have tested the following regexes:

Could anyone please help me with the right regex to achieve what I need?

Thanks a lot.

Upvotes: 1

Views: 1308

Answers (2)

paradocslover
paradocslover

Reputation: 3294

Here is exactly what you need:

import re
s = 'Received from SOMEONE           11/02/2020   13∶20     $ 997,63   $ 997,63            -'
print(re.findall('(.+?\d\d/\d\d/\d{4})\s*(.+)',s)[0])

The regex string can be explained as follows: Anything .+? followed by a date \d\d/\d\d/\d{4} followed by some space \s* followed by anything .+

Output:

('Received from SOMEONE           11/02/2020', '13∶20     $ 997,63   $ 997,63            -')

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163352

You could split using a lookahead that asserts a time like pattern:

(?=(?:\b[01]\d|2[0-3]):[0-5]\d\b)

Regex demo

Note that when I copy the example data the is this char https://www.compart.com/en/unicode/U+2236 and in the regex I have used this : char https://www.compart.com/en/unicode/U+003A

If you want to match both, you could use a character class [:∶]

(?=(?:\b[01]\d|2[0-3])[:∶][0-5]\d\b)

Regex demo

import re

regex = r"(?=(?:\b[01]\d|2[0-3]):[0-5]\d\b)"

test_str = "Received from SOMEONE           11/02/2020   13:20     $ 997,63   $ 997,63            -"
print(re.split(regex, test_str))

Output

['Received from SOMEONE           11/02/2020   ', '13:20     $ 997,63   $ 997,63            -']

Upvotes: 1

Related Questions