Reputation: 1079
I am trying to convert a messy notebook with dates into a sorted date series in pandas.
0 03/25/93 Total time of visit (in minutes):\n
1 6/18/85 Primary Care Doctor:\n
2 sshe plans to move as of 7/8/71 In-Home Servic...
3 7 on 9/27/75 Audit C Score Current:\n
4 2/6/96 sleep studyPain Treatment Pain Level (N...
5 .Per 7/06/79 Movement D/O note:\n
6 4, 5/18/78 Patient's thoughts about current su...
7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8 3/7/86 SOS-10 Total Score:\n
9 (4/10/71)Score-1Audit C Score Current:\n
10 (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC...
11 4/09/75 SOS-10 Total Score:\n
12 8/01/98 Communication with referring physician...
13 1/26/72 Communication with referring physician...
14 5/24/1990 CPT Code: 90792: With medical servic...
15 1/25/2011 CPT Code: 90792: With medical servic...
I have multiple dates formats such as 04/20/2009; 04/20/09; 4/20/09; 4/3/09. And I would like to convert all these into mm/dd/yyyy to a new column.
So far I have done
df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})')
Also, I do not how to extract all lines with only mm/yy or yyyy format date without interfering with the code above. Bear in mind that with the absence of day or month I would consider 1st and January as default values.
Upvotes: 1
Views: 1131
Reputation: 402603
You can use pd.Series.str.extract
with a regex, and then apply pd.to_datetime
:
pd.to_datetime(df['Text'].str.extract(
r'(?P<Date>\d+(?:\/\d+){2})', expand=False), errors='coerce')
0 1993-03-25
1 1985-06-18
2 1971-07-08
3 1975-09-27
4 1996-02-06
5 1979-07-06
6 1978-05-18
7 1989-10-24
8 1986-03-07
9 1971-04-10
10 1985-05-11
11 1975-04-09
12 1998-08-01
13 1972-01-26
14 1990-05-24
15 2011-01-25
Name: Date, dtype: datetime64[ns]
str.extract
returns a series of strings that look like this:
array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79',
'5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75',
'8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object)
Regex Details
(?P<Date>\d+(?:\/\d+){2})
(?P<Date>....)
- named capturing group\d+
1 or more digits(?:\/\d+){2}
- non-capturing group repeating twice, where
\/
- escaped forward slash{2}
- repeater (two times)Regex for Missing Days
To handle optional days
, a slightly modified regex is required:
(?P<Date>(?:\d+\/)?\d+/\d+)
Details
(?P<Date>....)
- named capturing group(?:\d+\/)?
- nested group (non-capturing) where \d+\/
is optional.\d+
1 or more digits\/
escaped forward slashThe rest is the same. Substitute this regex in place of the current one. pd.to_datetime
will handle missing days.
Upvotes: 1