toasted
toasted

Reputation: 7

Extracting dates from string and into a dataframe into a new column python

I am trying to extract dates from a set format of MONDAY, JUNE 14, 2021 from a sea of text in a data frame and create a new column for the extracted dates.

For example, the data frame looks like:

x title ii
xx aldkla MONDAY, JUNE 14, 2021 slkfdklkdj lkskjdfkdlskld
yy alksflksd TUESDAY, November, 23, 2020 alfjsldjf sfdadsf

And I am trying to create the end product to be:

x title ii Date
xx aldkla MONDAY, JUNE 14, 2021 slkfdklkdj lkskjdfkdlskld 2021-06-14
yy alksflksd TUESDAY, November, 23, 2020 alfjsldjf sfdadsf 2020-11-23

Currently, I have been using the library datefinder because I found it to be able to convert the str to date the best. However, once I start using it with a dataframe it does work. This is what I put.

df = dataframe #the example above 


Date = datefinder.find_dates(df.title)
for match in Date: 
  df["time"] = match

Is there a better method? Or am I going about this all wrong?

Upvotes: 1

Views: 1375

Answers (3)

Ferris
Ferris

Reputation: 5601

# method 1
obj = df['title']
Date = pd.to_datetime(obj.str.replace(r'^.*?(\w+)[\s,]+(\d+)[\s,]+(\d{4}).*?$', r'\1-\2-\3'), 
               format='%B-%d-%Y', 
               errors='ignore')

# method 2
import datefinder
Date = df['title'].map(datefinder.find_dates).map(list).str[0]

Upvotes: 0

FObersteiner
FObersteiner

Reputation: 25544

dateutil's parser does a good job with kwarg fussy=True:

import pandas as pd
from dateutil import parser

s = pd.Series(["aldkla MONDAY, JUNE 14, 2021 slkfdklkdj", "alksflksd TUESDAY, November, 23, 2020 alfjsldjf"])

s.apply(parser.parse, fuzzy=True)

>>>
0   2021-06-14
1   2020-11-23
dtype: datetime64[ns]

Upvotes: 2

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520928

Using str.replace along with to_datetime:

df["Date"] = pd.to_datetime(df["title"].str.replace(r'\b([^\s,]+),? (\d+), (\d{4})\b', r'\1\2\3'), format='%B%d%Y', errors='ignore')

One slight potential problem with the above might occur if the days in your text dates are not zero left padded. In that case, more work would be needed to get those dates to parse properly.

Upvotes: 1

Related Questions