Extracting and parsing dates in a pandas dataframe

Question

I am trying to convert a messy notebook with dates into a sorted date series in pandas.

0           03/25/93 Total time of visit (in minutes):

1                         6/18/85 Primary Care Doctor:

2      sshe plans to move as of 7/8/71 In-Home Servic...
3                  7 on 9/27/75 Audit C Score Current:

4      2/6/96 sleep studyPain Treatment Pain Level (N...
5                      .Per 7/06/79 Movement D/O note:

6      4, 5/18/78 Patient's thoughts about current su...
7      10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                           3/7/86 SOS-10 Total Score:

9               (4/10/71)Score-1Audit C Score Current:

10     (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC...
11                         4/09/75 SOS-10 Total Score:

12     8/01/98 Communication with referring physician...
13     1/26/72 Communication with referring physician...
14     5/24/1990 CPT Code: 90792: With medical servic...
15     1/25/2011 CPT Code: 90792: With medical servic...

I have multiple dates formats such as 04/20/2009; 04/20/09; 4/20/09; 4/3/09. And I would like to convert all these into mm/dd/yyyy to a new column.

So far I have done

df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})')

Also, I do not how to extract all lines with only mm/yy or yyyy format date without interfering with the code above. Bear in mind that with the absence of day or month I would consider 1st and January as default values.

cs95 · Accepted Answer

You can use pd.Series.str.extract with a regex, and then apply pd.to_datetime:

pd.to_datetime(df['Text'].str.extract(
    r'(?P\d+(?:\/\d+){2})', expand=False), errors='coerce')

0    1993-03-25
1    1985-06-18
2    1971-07-08
3    1975-09-27
4    1996-02-06
5    1979-07-06
6    1978-05-18
7    1989-10-24
8    1986-03-07
9    1971-04-10
10   1985-05-11
11   1975-04-09
12   1998-08-01
13   1972-01-26
14   1990-05-24
15   2011-01-25
Name: Date, dtype: datetime64[ns]

str.extract returns a series of strings that look like this:

array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79',
       '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75',
       '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object)

Regex Details

(?P\d+(?:\/\d+){2})

(?P....) - named capturing group
\d+ 1 or more digits
(?:\/\d+){2} - non-capturing group repeating twice, where
- \/ - escaped forward slash
- {2} - repeater (two times)

Regex for Missing Days

To handle optional days, a slightly modified regex is required:

(?P(?:\d+\/)?\d+/\d+)

Details

(?P....) - named capturing group
(?:\d+\/)? - nested group (non-capturing) where \d+\/ is optional.

\d+ 1 or more digits
\/ escaped forward slash

The rest is the same. Substitute this regex in place of the current one. pd.to_datetime will handle missing days.

Extracting and parsing dates in a pandas dataframe

Answers (1)

Related Questions