Reputation: 6920
I'm trying to extract time from single strings where in one string there will be texts other than only time. An example is s = 'Dates : 12/Jul/2019 12/Aug/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58'
.
I've tried using datefinder
module like this :
from datetime import datetime as dt
import datefinder as dfn
for m in dfn.find_dates(s):
print(dt.strftime(m, "%H:%M:%S"))
Which gives me this :
17:58:00
In this case the time "06:00"
is missed out. Now if I try without datefinder
with only datetime
module like this :
dt.strftime(s, "%H:%M")
It notifies me that the input must be a datetime object already, not a string with the following error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'str'
So I tried to use dateutil
module to parse this string s
to a datetime object with this :
from dateutil.parser import parse
parse(s)
but, now it now says that my string is not in proper format (which in most cases will not be in any fixed format), showing me this error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1358, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 649, in parse
raise ValueError("Unknown string format:", timestr)
ValueError: ('Unknown string format:', '12/Jul/2019 12/Aug/2019 MEISHAN BRIDGE 06:00 17:58')
I have thought of getting the time with regex like
import re
p = r"\d{2}\:\d{2}"
times = [i.group() for i in re.finditer(p, s)]
# Gives me ['06:00', '17:58']
But doing this way will need me to check again whether this regex matched chunks are actually time or not because even "99:99"
could be regex matched rightly and told as time wrongly. Is there any work around without regex to get all the times from a single string?
Please note that the string might contain or might not contain any date, but it will contain a time always. Even if it contains date, the date format might be anything on earth and also this string might or might not contain other irrelevant texts.
Upvotes: 0
Views: 1007
Reputation: 79208
you could use dictionaries:
my_dict = {}
for i in s.split(', '):
m = i.strip().split(' : ', 1)
my_dict[m[0]] = m[1].split()
my_dict
Out:
{'Dates': ['12/Jul/2019', '12/Aug/2019'],
'Loc': ['MEISHAN', 'BRIDGE'],
'Time': ['06:00', '17:58']}
Upvotes: 0
Reputation: 5682
I don't see many options here, so I would go with a heuristic. I would run the following against the whole dataset and extend the config/regexes until it covers all/most of the cases:
import re
import logging
from datetime import datetime as dt
s = 'Dates : 12/Jul/2019 12/08/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58:59'
SUPPORTED_DATE_FMTS = {
re.compile(r"(\d{2}/\w{3}/\d{4})"): "%d/%b/%Y",
re.compile(r"(\d{2}/\d{2}/\d{4})"): "%d/%m/%Y",
re.compile(r"(\d{2}/\w{3}\w+/\d{4})"): "%d/%B/%Y",
# Capture more here
}
SUPPORTED_TIME_FMTS = {
re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9])[^:]"): "%H:%M",
re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9]:[0-5][0-9])"): "%H:%M:%S",
# Capture more here
}
def extract_supported_dt(config, s):
"""
Loop thru the given config (keys are regexes, values are date/time format)
and attempt to gather all valid data.
"""
valid_data = []
for regex, fmt in config.items():
# Extract what you think looks like date
valid_ish_data = regex.findall(s)
if not valid_ish_data:
continue
print("Checking " + str(valid_ish_data))
# validate it
for d in valid_ish_data:
try:
valid_data.append(dt.strptime(d, fmt))
except ValueError:
pass
return valid_data
# Handle dates
dates = extract_supported_dt(SUPPORTED_DATE_FMTS, s)
# Handle times
times = extract_supported_dt(SUPPORTED_TIME_FMTS, s)
print("Found dates: ")
for date in dates:
print("\t" + str(date.date()))
print("Found times: ")
for t in times:
print("\t" + str(t.time()))
Example output:
Checking ['12/Jul/2019']
Checking ['12/08/2019']
Checking ['06:00']
Checking ['17:58:59']
Found dates:
2019-07-12
2019-08-12
Found times:
06:00:00
17:58:59
This is a trial and error approach but I do not think there is an alternative in your case. Thus my goal here is to make it as easy as possible to extend support with more date/time formats as opposed to try to find a solution that covers 100% of the data day-1. This way, the more data you run against the more complete your config will be.
One thing to note is that you will have to detect strings that appear to have no dates and log them somewhere. Later you will need to manually revise and see if something that was missed could be captured.
Now, assuming that your data are being generated by another system, sooner or later you will be able to match 100% of it. If the data input is from human, then you will probably never manage to get 100%! (people tend to make spelling mistakes and sometimes import random stuff... date=today :) )
Upvotes: 1
Reputation: 2803
Use Regex But Something Like This,
(?=[0-1])[0-1][0-9]\:[0-5][0-9]|(?=2)[2][0-3]\:[0-5][0-9]
This Matched
00:00, 00:59 01:00 01:59 02:00 02: 59 09:00 10:00 11:59 20:00 21:59 23:59
Not work for
99:99 23:99 01:99
Check Here Dude if it works for You
Upvotes: 0
Reputation: 2742
How to extract multiple time from same string in Python?
If you need only time this regex should work fine
r"[0-2][0-9]\:[0-5][0-9]"
If there could be spaces in time like 23 : 59
use this
r"[0-2][0-9]\s*\:\s*[0-5][0-9]"
Upvotes: 0