Reputation: 8816
I have a text file having multiple lines one of the line contains field description
, and that field has multiple combination or notation of dates surrounded by other strings like colasas|04/18/2017|NXP
, FTP Permanent|09|10|2012|FTP
, and Project|16 July 2005|Design
. from which I want to parse the dates only, One way I found is to use dateutil
module which looks to be complicated and lot of manipulation for this purpose.
So, while going through the examples test , it works for certain combinations..
>>> from dateutil.parser import parse
>>> test_cases = ['04/30/2009', '06/20/95', '8/2/69', '1/25/2011', '9/3/2002', '4-13-82', 'Mar-02-2009', 'Jan 20, 1974',
... 'March 20, 1990', 'Dec. 21, 2001', 'May 25 2009', '01 Mar 2002', '2 April 2003', '20 Aug. 2004',
... '20 November, 1993', 'Aug 10th, 1994', 'Sept 1st, 2005', 'Feb. 22nd, 1988', 'Sept 2002', 'Sep 2002',
... 'December, 1998', 'Oct. 2000', '6/2008', '12/2001', '1998', '2002']
>>> for date_string in test_cases:
... print(date_string, parse(date_string).strftime("%Y%m%d"))
...
04/30/2009 20090430
06/20/95 19950620
8/2/69 19690802
----- etc --------
However, I have the below data combination which I need to parse but while opting for above solution it fails to get the results..
As description
is optional as it may be missing at certain point so , I considered using (?:description:* (.*))?
.
description: colasas|04/18/2017|NXP
description: colasas|04/18/2017|NXP
description: Remedy Tkt 01212152 Orcad move
description: FTP Permanent|09|10|2012|FTP
description: Remedy Tkt 01212152 Orcad move
description: TDA Drop12 Account|July 2004|TDA Drop12 Account
description: ftp|121210|ftp
description: Design Foundry Project|16 July 2005|Design Foundry Project
description: FTP Permanent|10/10/2010|FTP
description: WFS-JP|7-31-05|WFS-JP
description: FTP Permanent|10|11|2010|FTP
I have re-formated the Question Just allow to more visibility to get more inputs.
Below is the actula script which is having three diffrent matches dn
, ftpuser
and the last description
which i'm looking for the solution.
Below script is working for all the matches but the last feild which is description having the mixed and raw data from which i need the dates only
"|"
.#!/usr/bin/python3
# ./dataparse.py
from __future__ import print_function
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
import re
with open('test2', 'r') as f:
for line in f:
line = line.strip()
data = f.read()
regex = (r"dn:(.*?)\nftpuser: (.*)\ndescription:* (.*)")
matchObj = re.findall(regex, data)
for index in matchObj:
#print(index)
index_str = ' '.join(index)
new_str = re.sub(r'[=,]', ' ', index_str)
new_str = new_str.split()
print("{0:<30}{1:<20}{2:<50}".format(new_str[1],new_str[8],new_str[9]))
Resulted output:
$ ./dataparse.py
ab02 disabled_5Mar07 Remedy
mela Y ROYALS|none|customer
ab01 Y VGVzdGluZyA
[email protected] T REG-JP|7-31-05|REG-JP
Upvotes: 1
Views: 366
Reputation: 2407
text="""
description: colasas|04/18/2017|NXP
description: colasas|04/18/2017|NXP
description: Remedy Tkt 01212152 Orcad move
description: FTP Permanent|09|10|2012|FTP
description: Remedy Tkt 01212152 Orcad move
description: TDA Drop12 Account|July 2004|TDA Drop12 Account
description: ftp|121210|ftp
description: Design Foundry Project|16 July 2005|Design Foundry Project
description: FTP Permanent|10/10/2010|FTP
description: WFS-JP|7-31-05|WFS-JP
description: FTP Permanent|10|11|2010|FTP
"""
import re
reg=re.compile(r"(?ms)\|(\d\d)(\d\d)(\d\d)\||\|(\d{1,2})[\|/\-](\d{1,2})[\|/\-](\d{2,4})\||\|(\d*)\s*(\w+)\s*(\d{4})\|")
dates= [ t[:3] if t[1] else t[3:6] if t[4] else t[6:] for t in reg.findall(text) ]
print(dates)
"""
regexp for |121210| ---> \|(\d\d)(\d\d)(\d\d)\|
for |16 July 2005| ---> \|(\d*)\s*(\w+)\s*(\d{4})\|
for the others ---> \|(\d{1,2})[\|/\-](\d{1,2})[\|/\-](\d{2,4})\|
"""
Output: [('04', '18', '2017'), ('04', '18', '2017'), ('09', '10', '2012'), ('', 'July', '2004'), ('12', '12', '10'), ('16', 'July', '2005'), ('10', '10', '2010'), ('7', '31', '05'), ('10', '11', '2010')]
Get the date as it is:
reg=re.compile(r"(?ms)\|(\d{6})\||\|(\d{1,2}[\|/\-]\d{1,2}[\|/\-]\d{2,4})\||\|(\d*\s*\w+\s+\d{4})\|")
dates= [ t[0] or t[1] or t[2] for t in reg.findall(text) ]
print(dates)
Output:
['04/18/2017', '04/18/2017', '09|10|2012', 'July 2004', '121210', '16 July 2005', '10/10/2010', '7-31-05', '10|11|2010']
Upvotes: 1
Reputation: 8816
I achieved it through regex
considering the values between pipes as follows:
"(?:description:* .*\|([0-9]{1,2}[-/]+[0-9]{1,2}[-/]+[0-9]{2,4})\|.*)?"
Upvotes: 0
Reputation: 82765
Using some string manipulation
Demo:
s = """description: colasas|04/18/2017|NXP
description: colasas|04/18/2017|NXP
description: Remedy Tkt 01212152 Orcad move
description: FTP Permanent|09|10|2012|FTP
description: Remedy Tkt 01212152 Orcad move
description: TDA Drop12 Account|July 2004|TDA Drop12 Account
description: ftp|121210|ftp
description: Design Foundry Project|16 July 2005|Design Foundry Project
description: FTP Permanent|10/10/2010|FTP
description: WFS-JP|7-31-05|WFS-JP
description: FTP Permanent|10|11|2010|FTP"""
from dateutil.parser import parse
for i in s.split("\n"):
val = i.split("|", 1) #Split by first "|"
if len(val) > 1: #Check if Date in string.
val = val[1].rpartition("|")[0] #Split by right "|"
print( parse(val, fuzzy=True) )
Output:
2017-04-18 00:00:00
2017-04-18 00:00:00
2012-07-03 00:00:00
2004-07-03 00:00:00
2010-12-12 00:00:00
2005-07-16 00:00:00
2010-10-10 00:00:00
2005-07-31 00:00:00
2010-07-03 00:00:00
Regarding your datetime error remove from datetime import datetime
Demo:
import re
import datetime
strh = "description: colasas|04/18/2017|NXP"
match = re.search(r'\d{2}/\d{2}/\d{4}', strh)
date = datetime.datetime.strptime(match.group(), '%m/%d/%Y').date()
print(date)
Upvotes: 1
Reputation: 362826
The parse
method you're using accepts a keyword argument to allow ignoring irrelevant parts of the string.
:param fuzzy:
Whether to allow fuzzy parsing, allowing for string like "Today is
January 1, 2047 at 8:21:00AM".
Demo:
>>> parse('colasas|04/18/2017|NXP', fuzzy=True)
datetime.datetime(2017, 4, 18, 0, 0)
There is another one to also return tuples including the parts of the string that were ignored:
>>> parse('colasas|04/18/2017|NXP', fuzzy_with_tokens=True)
(datetime.datetime(2017, 4, 18, 0, 0), ('colasas|', '|NXP'))
This method won't work perfectly with all of your input strings, but it should get you most of the way there. You may have to do some pre-processing for the stranger ones.
Upvotes: 2