JPV
JPV

Reputation: 1079

Extracting and parsing dates in a pandas dataframe

I am trying to convert a messy notebook with dates into a sorted date series in pandas.

0           03/25/93 Total time of visit (in minutes):\n
1                         6/18/85 Primary Care Doctor:\n
2      sshe plans to move as of 7/8/71 In-Home Servic...
3                  7 on 9/27/75 Audit C Score Current:\n
4      2/6/96 sleep studyPain Treatment Pain Level (N...
5                      .Per 7/06/79 Movement D/O note:\n
6      4, 5/18/78 Patient's thoughts about current su...
7      10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                           3/7/86 SOS-10 Total Score:\n
9               (4/10/71)Score-1Audit C Score Current:\n
10     (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC...
11                         4/09/75 SOS-10 Total Score:\n
12     8/01/98 Communication with referring physician...
13     1/26/72 Communication with referring physician...
14     5/24/1990 CPT Code: 90792: With medical servic...
15     1/25/2011 CPT Code: 90792: With medical servic...

I have multiple dates formats such as 04/20/2009; 04/20/09; 4/20/09; 4/3/09. And I would like to convert all these into mm/dd/yyyy to a new column.

So far I have done

df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})')

Also, I do not how to extract all lines with only mm/yy or yyyy format date without interfering with the code above. Bear in mind that with the absence of day or month I would consider 1st and January as default values.

Upvotes: 1

Views: 1131

Answers (1)

cs95
cs95

Reputation: 402603

You can use pd.Series.str.extract with a regex, and then apply pd.to_datetime:

pd.to_datetime(df['Text'].str.extract(
    r'(?P<Date>\d+(?:\/\d+){2})', expand=False), errors='coerce')

0    1993-03-25
1    1985-06-18
2    1971-07-08
3    1975-09-27
4    1996-02-06
5    1979-07-06
6    1978-05-18
7    1989-10-24
8    1986-03-07
9    1971-04-10
10   1985-05-11
11   1975-04-09
12   1998-08-01
13   1972-01-26
14   1990-05-24
15   2011-01-25
Name: Date, dtype: datetime64[ns]

str.extract returns a series of strings that look like this:

array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79',
       '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75',
       '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object)

Regex Details

(?P<Date>\d+(?:\/\d+){2})
  • (?P<Date>....) - named capturing group
  • \d+ 1 or more digits
  • (?:\/\d+){2} - non-capturing group repeating twice, where
    • \/ - escaped forward slash
    • {2} - repeater (two times)

Regex for Missing Days

To handle optional days, a slightly modified regex is required:

(?P<Date>(?:\d+\/)?\d+/\d+)

Details

  • (?P<Date>....) - named capturing group
  • (?:\d+\/)? - nested group (non-capturing) where \d+\/ is optional.
  • \d+ 1 or more digits
  • \/ escaped forward slash

The rest is the same. Substitute this regex in place of the current one. pd.to_datetime will handle missing days.

Upvotes: 1

Related Questions