Reputation: 2795
I'm coming across something that is almost certainly a stupid mistake on my part, but I can't seem to figure out what's going on.
Essentially, I have a series of dates as strings in the format "%d-%b-%y"
, such as 26-Sep-05
. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.
E.g.:
dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']
pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
'2061-01-09', '2055-02-08'],
dtype='datetime64[ns]', freq=None)
The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70
entry. What's going on here?
Upvotes: 20
Views: 23698
Reputation: 11
If running into the same problem using a pandas DataFrame, try using the current year or year greater than a particular year, then apply a lambda function similar to below:
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > dt.datetime.now() else x)
or
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > 2022 else x)
Upvotes: 0
Reputation: 136
Another quick solution to the problem:-
import pandas as pd
import numpy as np
dates = pd.DataFrame(['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55'])
for i in dates:
tempyear=pd.to_numeric(dates[i].str[-2:])
dates["temp_year"]=np.where((tempyear>=44)&(tempyear<=99),tempyear+1900,tempyear+2000).astype(str)
dates["temp_month"]=dates[i].str[:-2]
dates["temp_flyr"]=dates["temp_month"]+dates["temp_year"]
dates["pddt"]=pd.to_datetime(dates.temp_flyr.str.upper(), format='%d-%b-%Y', yearfirst=False)
tempdrops=["temp_year","temp_month","temp_flyr",i]
dates.drop(tempdrops, axis=1, inplace=True)
And the output is as follows, here I have converted the output to pandas datetime format from object using pd.to_datetime
pddt
0 2005-09-26
1 2005-09-26
2 1970-06-15
3 1994-12-05
4 1961-01-09
5 1955-02-08
As mentioned in some other answers this works best if there is no overlap between the dates of the two centuries.
Upvotes: 1
Reputation: 41
You can write a simple function to correct this parsing of wrong year as stated below:
import datetime
def fix_date(x):
if x.year > 1989:
year = x.year - 100
else:
year = x.year
return datetime.date(year,x.month,x.day)
df['date_column'] = data['date_column'].apply(fix_date)
Hope this helps..
Upvotes: 4
Reputation: 9103
For anyone looking for a quick and dirty code snippet to fix these cases, this worked for me:
from datetime import timedelta, date
col = 'date'
df[col] = pd.to_datetime(df[col])
future = df[col] > date(year=2050,month=1,day=1)
df.loc[future, col] -= timedelta(days=365.25*100)
You may need to tune the threshold date closer to the present depending on the earliest dates in your data.
Upvotes: 11
Reputation: 210832
from the docs
Year 2000 (Y2K) issues: Python depends on the platform’s C library, which generally doesn’t have year 2000 issues, since all dates and times are represented internally as seconds since the epoch. Function strptime() can parse 2-digit years when given %y format code. When 2-digit years are parsed, they are converted according to the POSIX and ISO C standards: values 69–99 are mapped to 1969–1999, and values 0–68 are mapped to 2000–2068.
Upvotes: 12
Reputation: 55448
That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:
datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)
datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)
Two digits year ambiguity
So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900
The %y
two digits can only go from 00
to 99
which is going to be ambiguous if we start crossing centuries.
If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)
I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).
If you have overlap then you can't process with insufficient year information, unless e.g. you have some ordered data and a reference
If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.
Upvotes: 23