Drunk Knight
Drunk Knight

Reputation: 141

How to extract date from string using Python 3.x

I require to find out the phone bill due date from SMS using Python 3.4 I have used dateutil.parser and datefinder but with no success as per my use-case.

Example: sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID [email protected]. Pls check Inbox"

Code 1:

import datefinder
due_dates = datefinder.find_dates(sms_text)
for match in due_dates:
    print(match)

Result: 2017-07-17 00:00:00

Code 2:

import dateutil.parser as dparser
due_date = dparser.parse(sms_text,fuzzy=True)
print(due_date)

Result: ValueError probably because of multiple dates in the text

How can I pick the due date from such texts? The date format is not fixed but there would be 2 dates in the text: one is the month for which bill is generated and other the is the due date, in the same order. Even if I get a regular expression to parse the text, it would be great.

More sample texts:

  1. Hello! Your phone billed outstanding is 293.72 due date is 03rd Jul.
  2. Bill dated 06-JUN-17 for Rs 219 is due today for your phone No. 1234567890
  3. Bill dated 06-JUN-17 for Rs 219 is due on Jul 5 for your phone No. 1234567890
  4. Bill dated 27-Jun-17 for your operator fixedline/broadband ID 1234567890 has been sent at [email protected] from [email protected]. Due amount: Rs 3,764.53, due date: 16-Jul-17.
  5. Details of bill dated 21-JUN-2017 for phone no. 1234567890: Total Due: Rs 374.12, Due Date: 09-JUL-2017, Bill Delivery Date: 25-Jun-2017,
  6. Greetings! Bill for your mobile 1234567890, dtd 18-Jun-17, payment due date 06-Jul-17 has been sent on [email protected]
  7. Dear customer, your phone bill of Rs.191.24 was due on 25-Jun-2017
  8. Hi! Your phone bill for Rs. 560.41 is due on 03-07-2017.

Upvotes: 0

Views: 11270

Answers (4)

Gall
Gall

Reputation: 1625

There are two things that prevent datefinder to parse correctly your samples:

  1. the bill amount: numbers are interpreted as years, so if they have 3 or 4 digits it creates a date
  2. characters defined as delimiters by datefinder might prevent to find a suitable date format (in this case ':')

The idea is to first sanitize the text by removing the parts of the text that prevent datefinder to identify all the dates. Unfortunately, this is a bit of try and error as the regex used by this package is too big for me to analyze thoroughly.

def extract_duedate(text):
    # Sanitize the text for datefinder by replacing the tricky parts 
    # with a non delimiter character
    text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)

    return list(datefinder.find_dates(text))[-1]

Rs[\d,\. ]+ will remove the bill amount so it is not mistaken as part of a date. It will match strings of the form 'Rs[.][ ][12,]345[.67]' (actually more variations but this is just to illustrate).

Obviously, this is a raw example function. Here are the results I get:

1 : 2017-07-03 00:00:00
2 : 2017-06-06 00:00:00 # Wrong result: first date instead of today
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00

There is one problem on the sample 2: 'today' is not recognized alone by datefinder

Example:

>>> list(datefinder.find_dates('Rs 219 is due today'))
[datetime.datetime(219, 7, 13, 0, 0)]
>>> list(datefinder.find_dates('is due today'))
[]

So, to handle this case, we could simply replace the token 'today' by the current date as a first step. This would give the following function:

def extract_duedate(text):
    if 'today' in text:
        text = text.replace('today', datetime.date.today().isoformat())

    # Sanitize the text for datefinder by replacing the tricky parts 
    # with a non delimiter character
    text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)

    return list(datefinder.find_dates(text))[-1]

Now the results are good for all samples:

1 : 2017-07-03 00:00:00
2 : 2017-07-18 00:00:00 # Well, this is the date of my test
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00

If you need, you can let the function return all dates and they should all be correct.

Upvotes: 2

ISV
ISV

Reputation: 27

An idea for using dateutil.parser:

from dateutil.parser import parse

for s in sms_text.split():
    try:
        print(parse(s))
    except ValueError:
        pass

Upvotes: 3

Daan ter horst
Daan ter horst

Reputation: 38

Having a text message as the example you have provided:

sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID [email protected]. Pls check Inbox"

It could be possible to use pythons build in regex module to match on the 'due on' and 'has been' parts of the string.

import re

sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID [email protected]. Pls check Inbox"

due_date = re.split('due on', re.split('has been', sms_text)[0])[1]

print(due_date)

Resulting: 15-07-2017

With this example the date format does not matter, but it is important that the words you are spliting the string on are consistent.

Upvotes: 0

Alexey
Alexey

Reputation: 1438

Why not just using regex? If your input strings always contain this substrings due on ... has been you can just do something like that:

import re
from datetime import datetime

string = """Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been
 sent to your regd email ID [email protected]. Pls check Inbox"""

match_obj = re.search(r'due on (.*) has been', string)

if match_obj:
    date_str = match_obj.group(1)
else:
    print "No match!!"
try:
    # DD-MM-YYYY
    print datetime.strptime(date_str, "%d-%m-%Y")
except ValueError:
    # try another format
    try:
        print datetime.strptime(date_str, "%Y-%m-%d")
    except ValueError:
        try:
            print datetime.strptime(date_str, "%m-%d")
        except ValueError:
            ...

Upvotes: 0

Related Questions