Reputation: 1323
I have a pandas DataFrame
of the form:
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 2014-01-01 00:00:00
I am interested in only the year, month and day in the birth
column of the dataframe
. I tried to leverage on the Python datetime
from pandas
but it resulted into an error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1054-02-07 00:00:00
The birth
column is an object
dtype.
My guess would be that it is an incorrect date. I would not like to pass the parameter errors="coerce"
into the to_datetime
method, because each item is important and I need just the YYYY-MM-DD
.
I tried to leverage on the regex
from pandas
:
df["birth"].str.find("(\d{4})-(\d{2})-(\d{2})")
But this is returning NANs
. How can I resolve this?
Thanks
Upvotes: 1
Views: 722
Reputation: 862581
Because not possible convert to datetimes you can use split
by first whitespace and then select first value:
df['birth'] = df['birth'].str.split().str[0]
And then if necessary convert to periods.
Representing out-of-bounds spans.
print (df)
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 0-01-01 00:00:00
def to_per(x):
splitted = x.split('-')
return pd.Period(year=int(splitted[0]),
month=int(splitted[1]),
day=int(splitted[2]), freq='D')
df['birth'] = df['birth'].str.split().str[0].apply(to_per)
print (df)
id amount birth
0 4 78.0 1980-02-02
1 5 24.0 1989-03-03
2 6 49.5 2014-01-01
3 7 34.0 2014-01-01
4 8 49.5 0000-01-01
Upvotes: 1