Reputation: 4867
I am parsing a PDF banking statement file one line at a time. The problem is that the format is not always consistent.
Sometimes I have a string as this:
'Received from SOMEONE 11/02/2020 13∶20 $ 997,63 $ 997,63 -'
I am using Python 3 and I need to split the string using a time regex so I can have 2 strings, so the expected result would be:
['Received from SOMEONE 11/02/2020 ', '13∶20 $ 997,63 $ 997,63 -'
Among many others, I have tested the following regexes:
r"\s+(?=\d+\d+:\d+\d\s)"
r"(?:(?:(\d+):)?(\d+))"
r"(2[0-3]|[01]?[0-9]):([0-5]?[0-9])"
r"(?:([01]?\d|2[0-3]):([0-5]?\d))"
Could anyone please help me with the right regex to achieve what I need?
Thanks a lot.
Upvotes: 1
Views: 1308
Reputation: 3294
Here is exactly what you need:
import re
s = 'Received from SOMEONE 11/02/2020 13∶20 $ 997,63 $ 997,63 -'
print(re.findall('(.+?\d\d/\d\d/\d{4})\s*(.+)',s)[0])
The regex string can be explained as follows: Anything .+?
followed by a date \d\d/\d\d/\d{4}
followed by some space \s*
followed by anything .+
Output:
('Received from SOMEONE 11/02/2020', '13∶20 $ 997,63 $ 997,63 -')
Upvotes: 0
Reputation: 163352
You could split using a lookahead that asserts a time like pattern:
(?=(?:\b[01]\d|2[0-3]):[0-5]\d\b)
Note that when I copy the example data the ∶
is this char https://www.compart.com/en/unicode/U+2236 and in the regex I have used this :
char https://www.compart.com/en/unicode/U+003A
If you want to match both, you could use a character class [:∶]
(?=(?:\b[01]\d|2[0-3])[:∶][0-5]\d\b)
import re
regex = r"(?=(?:\b[01]\d|2[0-3]):[0-5]\d\b)"
test_str = "Received from SOMEONE 11/02/2020 13:20 $ 997,63 $ 997,63 -"
print(re.split(regex, test_str))
Output
['Received from SOMEONE 11/02/2020 ', '13:20 $ 997,63 $ 997,63 -']
Upvotes: 1