Karn Kumar
Karn Kumar

Reputation: 8816

Python regex to get the date from different combinations

I have a text file having multiple lines one of the line contains field description, and that field has multiple combination or notation of dates surrounded by other strings like colasas|04/18/2017|NXP , FTP Permanent|09|10|2012|FTP, and Project|16 July 2005|Design. from which I want to parse the dates only, One way I found is to use dateutil module which looks to be complicated and lot of manipulation for this purpose.

So, while going through the examples test , it works for certain combinations..

>>> from dateutil.parser import parse
>>> test_cases = ['04/30/2009', '06/20/95', '8/2/69', '1/25/2011', '9/3/2002', '4-13-82', 'Mar-02-2009', 'Jan 20, 1974',
...               'March 20, 1990', 'Dec. 21, 2001', 'May 25 2009', '01 Mar 2002', '2 April 2003', '20 Aug. 2004',
...               '20 November, 1993', 'Aug 10th, 1994', 'Sept 1st, 2005', 'Feb. 22nd, 1988', 'Sept 2002', 'Sep 2002',
...               'December, 1998', 'Oct. 2000', '6/2008', '12/2001', '1998', '2002']
>>> for date_string in test_cases:
...     print(date_string, parse(date_string).strftime("%Y%m%d"))
...
04/30/2009 20090430
06/20/95 19950620
8/2/69 19690802
----- etc --------

However, I have the below data combination which I need to parse but while opting for above solution it fails to get the results..

As description is optional as it may be missing at certain point so , I considered using (?:description:* (.*))? .

description: colasas|04/18/2017|NXP
description: colasas|04/18/2017|NXP
description: Remedy Tkt 01212152 Orcad move
description: FTP Permanent|09|10|2012|FTP
description: Remedy Tkt 01212152 Orcad move
description: TDA Drop12 Account|July 2004|TDA Drop12 Account
description: ftp|121210|ftp
description: Design Foundry Project|16 July 2005|Design Foundry Project
description: FTP Permanent|10/10/2010|FTP
description: WFS-JP|7-31-05|WFS-JP
description: FTP Permanent|10|11|2010|FTP

I have re-formated the Question Just allow to more visibility to get more inputs.

Below is the actula script which is having three diffrent matches dn , ftpuser and the last description which i'm looking for the solution. Below script is working for all the matches but the last feild which is description having the mixed and raw data from which i need the dates only

and the dates are encapsulated between PIPES"|".

#!/usr/bin/python3
# ./dataparse.py
from __future__ import print_function
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
import re
with open('test2', 'r') as f:
    for line in f:
        line = line.strip()
        data = f.read()
        regex = (r"dn:(.*?)\nftpuser: (.*)\ndescription:* (.*)")
        matchObj = re.findall(regex, data)
        for index in matchObj:
            #print(index)
            index_str = ' '.join(index)
            new_str = re.sub(r'[=,]', ' ', index_str)
            new_str = new_str.split()
            print("{0:<30}{1:<20}{2:<50}".format(new_str[1],new_str[8],new_str[9]))

Resulted output:

$ ./dataparse.py
ab02                          disabled_5Mar07     Remedy
mela                          Y                   ROYALS|none|customer
ab01                          Y                   VGVzdGluZyA
[email protected]                   T                   REG-JP|7-31-05|REG-JP

Upvotes: 1

Views: 366

Answers (4)

kantal
kantal

Reputation: 2407

text="""
description: colasas|04/18/2017|NXP
description: colasas|04/18/2017|NXP
description: Remedy Tkt 01212152 Orcad move
description: FTP Permanent|09|10|2012|FTP
description: Remedy Tkt 01212152 Orcad move
description: TDA Drop12 Account|July 2004|TDA Drop12 Account
description: ftp|121210|ftp
description: Design Foundry Project|16 July 2005|Design Foundry Project
description: FTP Permanent|10/10/2010|FTP
description: WFS-JP|7-31-05|WFS-JP
description: FTP Permanent|10|11|2010|FTP
"""
import re

reg=re.compile(r"(?ms)\|(\d\d)(\d\d)(\d\d)\||\|(\d{1,2})[\|/\-](\d{1,2})[\|/\-](\d{2,4})\||\|(\d*)\s*(\w+)\s*(\d{4})\|")

dates= [ t[:3] if t[1] else t[3:6] if t[4] else t[6:] for t in reg.findall(text) ]
print(dates)

"""
    regexp for |121210| ---> \|(\d\d)(\d\d)(\d\d)\|
    for |16 July 2005| ---> \|(\d*)\s*(\w+)\s*(\d{4})\|
    for the others ---> \|(\d{1,2})[\|/\-](\d{1,2})[\|/\-](\d{2,4})\|
"""
Output: [('04', '18', '2017'), ('04', '18', '2017'), ('09', '10', '2012'), ('', 'July', '2004'), ('12', '12', '10'), ('16', 'July', '2005'), ('10', '10', '2010'), ('7', '31', '05'), ('10', '11', '2010')]

Get the date as it is:

reg=re.compile(r"(?ms)\|(\d{6})\||\|(\d{1,2}[\|/\-]\d{1,2}[\|/\-]\d{2,4})\||\|(\d*\s*\w+\s+\d{4})\|")

dates= [ t[0] or t[1] or t[2] for t in reg.findall(text) ]
print(dates)

Output:
['04/18/2017', '04/18/2017', '09|10|2012', 'July 2004', '121210', '16 July 2005', '10/10/2010', '7-31-05', '10|11|2010']

Upvotes: 1

Karn Kumar
Karn Kumar

Reputation: 8816

I achieved it through regex considering the values between pipes as follows:

"(?:description:* .*\|([0-9]{1,2}[-/]+[0-9]{1,2}[-/]+[0-9]{2,4})\|.*)?"

Upvotes: 0

Rakesh
Rakesh

Reputation: 82765

Using some string manipulation

Demo:

s = """description: colasas|04/18/2017|NXP
description: colasas|04/18/2017|NXP
description: Remedy Tkt 01212152 Orcad move
description: FTP Permanent|09|10|2012|FTP
description: Remedy Tkt 01212152 Orcad move
description: TDA Drop12 Account|July 2004|TDA Drop12 Account
description: ftp|121210|ftp
description: Design Foundry Project|16 July 2005|Design Foundry Project
description: FTP Permanent|10/10/2010|FTP
description: WFS-JP|7-31-05|WFS-JP
description: FTP Permanent|10|11|2010|FTP"""


from dateutil.parser import parse

for i in s.split("\n"):
    val = i.split("|", 1)                            #Split by first "|"
    if len(val) > 1:                                 #Check if Date in string.
        val = val[1].rpartition("|")[0]               #Split by right "|"
        print( parse(val, fuzzy=True) )

Output:

2017-04-18 00:00:00
2017-04-18 00:00:00
2012-07-03 00:00:00
2004-07-03 00:00:00
2010-12-12 00:00:00
2005-07-16 00:00:00
2010-10-10 00:00:00
2005-07-31 00:00:00
2010-07-03 00:00:00

Regarding your datetime error remove from datetime import datetime

Demo:

import re
import datetime
strh = "description: colasas|04/18/2017|NXP"
match = re.search(r'\d{2}/\d{2}/\d{4}', strh)
date = datetime.datetime.strptime(match.group(), '%m/%d/%Y').date()
print(date)

Upvotes: 1

wim
wim

Reputation: 362826

The parse method you're using accepts a keyword argument to allow ignoring irrelevant parts of the string.

:param fuzzy:
    Whether to allow fuzzy parsing, allowing for string like "Today is
    January 1, 2047 at 8:21:00AM".

Demo:

>>> parse('colasas|04/18/2017|NXP', fuzzy=True)
datetime.datetime(2017, 4, 18, 0, 0)

There is another one to also return tuples including the parts of the string that were ignored:

>>> parse('colasas|04/18/2017|NXP', fuzzy_with_tokens=True)
(datetime.datetime(2017, 4, 18, 0, 0), ('colasas|', '|NXP'))

This method won't work perfectly with all of your input strings, but it should get you most of the way there. You may have to do some pre-processing for the stranger ones.

Upvotes: 2

Related Questions