Reputation: 7
I am trying to extract dates from a set format of MONDAY, JUNE 14, 2021 from a sea of text in a data frame and create a new column for the extracted dates.
For example, the data frame looks like:
x | title | ii |
---|---|---|
xx | aldkla MONDAY, JUNE 14, 2021 slkfdklkdj | lkskjdfkdlskld |
yy | alksflksd TUESDAY, November, 23, 2020 alfjsldjf | sfdadsf |
And I am trying to create the end product to be:
x | title | ii | Date |
---|---|---|---|
xx | aldkla MONDAY, JUNE 14, 2021 slkfdklkdj | lkskjdfkdlskld | 2021-06-14 |
yy | alksflksd TUESDAY, November, 23, 2020 alfjsldjf | sfdadsf | 2020-11-23 |
Currently, I have been using the library datefinder because I found it to be able to convert the str to date the best. However, once I start using it with a dataframe it does work. This is what I put.
df = dataframe #the example above
Date = datefinder.find_dates(df.title)
for match in Date:
df["time"] = match
Is there a better method? Or am I going about this all wrong?
Upvotes: 1
Views: 1375
Reputation: 5601
# method 1
obj = df['title']
Date = pd.to_datetime(obj.str.replace(r'^.*?(\w+)[\s,]+(\d+)[\s,]+(\d{4}).*?$', r'\1-\2-\3'),
format='%B-%d-%Y',
errors='ignore')
# method 2
import datefinder
Date = df['title'].map(datefinder.find_dates).map(list).str[0]
Upvotes: 0
Reputation: 25544
dateutil's parser does a good job with kwarg fussy=True
:
import pandas as pd
from dateutil import parser
s = pd.Series(["aldkla MONDAY, JUNE 14, 2021 slkfdklkdj", "alksflksd TUESDAY, November, 23, 2020 alfjsldjf"])
s.apply(parser.parse, fuzzy=True)
>>>
0 2021-06-14
1 2020-11-23
dtype: datetime64[ns]
Upvotes: 2
Reputation: 520928
Using str.replace
along with to_datetime
:
df["Date"] = pd.to_datetime(df["title"].str.replace(r'\b([^\s,]+),? (\d+), (\d{4})\b', r'\1\2\3'), format='%B%d%Y', errors='ignore')
One slight potential problem with the above might occur if the days in your text dates are not zero left padded. In that case, more work would be needed to get those dates to parse properly.
Upvotes: 1