Reputation: 141
I require to find out the phone bill due date from SMS using Python 3.4 I have used dateutil.parser and datefinder but with no success as per my use-case.
Example: sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID [email protected]. Pls check Inbox"
Code 1:
import datefinder
due_dates = datefinder.find_dates(sms_text)
for match in due_dates:
print(match)
Result: 2017-07-17 00:00:00
Code 2:
import dateutil.parser as dparser
due_date = dparser.parse(sms_text,fuzzy=True)
print(due_date)
Result: ValueError probably because of multiple dates in the text
How can I pick the due date from such texts? The date format is not fixed but there would be 2 dates in the text: one is the month for which bill is generated and other the is the due date, in the same order. Even if I get a regular expression to parse the text, it would be great.
More sample texts:
Upvotes: 0
Views: 11270
Reputation: 1625
There are two things that prevent datefinder
to parse correctly your samples:
datefinder
might prevent to find a suitable date format (in this case ':'
)The idea is to first sanitize the text by removing the parts of the text that prevent datefinder
to identify all the dates. Unfortunately, this is a bit of try and error as the regex used by this package is too big for me to analyze thoroughly.
def extract_duedate(text):
# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)
return list(datefinder.find_dates(text))[-1]
Rs[\d,\. ]+
will remove the bill amount so it is not mistaken as part of a date. It will match strings of the form 'Rs[.][ ][12,]345[.67]'
(actually more variations but this is just to illustrate).
Obviously, this is a raw example function. Here are the results I get:
1 : 2017-07-03 00:00:00
2 : 2017-06-06 00:00:00 # Wrong result: first date instead of today
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00
There is one problem on the sample 2: 'today' is not recognized alone by datefinder
Example:
>>> list(datefinder.find_dates('Rs 219 is due today'))
[datetime.datetime(219, 7, 13, 0, 0)]
>>> list(datefinder.find_dates('is due today'))
[]
So, to handle this case, we could simply replace the token 'today'
by the current date as a first step. This would give the following function:
def extract_duedate(text):
if 'today' in text:
text = text.replace('today', datetime.date.today().isoformat())
# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)
return list(datefinder.find_dates(text))[-1]
Now the results are good for all samples:
1 : 2017-07-03 00:00:00
2 : 2017-07-18 00:00:00 # Well, this is the date of my test
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00
If you need, you can let the function return all dates and they should all be correct.
Upvotes: 2
Reputation: 27
An idea for using dateutil.parser
:
from dateutil.parser import parse
for s in sms_text.split():
try:
print(parse(s))
except ValueError:
pass
Upvotes: 3
Reputation: 38
Having a text message as the example you have provided:
sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID [email protected]. Pls check Inbox"
It could be possible to use pythons build in regex module to match on the 'due on' and 'has been' parts of the string.
import re
sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID [email protected]. Pls check Inbox"
due_date = re.split('due on', re.split('has been', sms_text)[0])[1]
print(due_date)
Resulting: 15-07-2017
With this example the date format does not matter, but it is important that the words you are spliting the string on are consistent.
Upvotes: 0
Reputation: 1438
Why not just using regex
? If your input strings always contain this substrings due on ... has been
you can just do something like that:
import re
from datetime import datetime
string = """Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been
sent to your regd email ID [email protected]. Pls check Inbox"""
match_obj = re.search(r'due on (.*) has been', string)
if match_obj:
date_str = match_obj.group(1)
else:
print "No match!!"
try:
# DD-MM-YYYY
print datetime.strptime(date_str, "%d-%m-%Y")
except ValueError:
# try another format
try:
print datetime.strptime(date_str, "%Y-%m-%d")
except ValueError:
try:
print datetime.strptime(date_str, "%m-%d")
except ValueError:
...
Upvotes: 0