jeremy w
jeremy w

Reputation: 11

Pandas: Read dates in column which has different formats

I would like to read a csv, with dates in a column, but the dates are in different formats within the column.

Specifically, some dates are in "dd/mm/yyyy" format, and some are in "4####" format (excel 1900 date system, serial number represents days elapsed since 1900/1/1).

Is there any way to use read_csv or pandas.to_datetime to convert the column to datetime?

Have tried using pandas.to_datetime with no parameters to no avail.

df["Date"] = pd.to_datetime(df["Date"])

Returns

ValueError: year 42613 is out of range

Presumably it can read the "dd/mm/yyyy" format fine but produces an error for the "4####" format.

Note: the column is mixed type as well

Appreciate any help

Example

dates = ['25/07/2016', '42315'] 
df = DataFrame (dates, columns=['Date'])
#desired output ['25/07/2016', '07/11/2015']

Upvotes: 1

Views: 975

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35636

Let's try:

dates = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
m = dates.isna()
dates.loc[m] = (
        pd.TimedeltaIndex(df.loc[m, 'Date'].astype(int), unit='d')
        + pd.Timestamp(year=1899, month=12, day=30)
)
df['Date'] = dates

Or alternatively with seconds conversion:

dates = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
m = dates.isna()
dates.loc[m] = pd.to_datetime(
    (df.loc[m, 'Date'].astype(int) - 25569) * 86400.0,
    unit='s'
)
df['Date'] = dates

df:

        Date
0 2016-07-25
1 2015-11-07

Explanation:

  1. First convert to datetime all that can be done with pd.to_datetime:
dates = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
  1. Check which values couldn't be converted:
m = dates.isna()
  1. Convert NaTs

a. Offset as days since 1899-12-30 using TimedeltaIndex + pd.Timestamp:

dates.loc[m] = (
        pd.TimedeltaIndex(df.loc[m, 'Date'].astype(int), unit='d')
        + pd.Timestamp(year=1899, month=12, day=30)
)

b. Or convert serial days to seconds mathematically:

dates.loc[m] = pd.to_datetime(
    (df.loc[m, 'Date'].astype(int) - 25569) * 86400.0,
    unit='s'
)
  1. Update the Date column:
df['Date'] = dates

Upvotes: 1

Related Questions