goidelg
goidelg

Reputation: 326

Converting pandas column to date, with many type of dates

How would you convert the below date column into a single formatted date column?

df = pd.DataFrame(data={'datecol': ["-",
                                    "44198",
                                    "2021/01/01", 
                                    "14.04.20",
                                    "2021-13-03"]})

print(df.dropna()) should return the result below:

      datecol
0  2021-01-02
1  2021-01-01
2  2020-04-14
3  2021-03-13

Upvotes: 1

Views: 357

Answers (2)

not_speshal
not_speshal

Reputation: 23146

  1. Convert all valid datetime formats using pd.to_datetime, specifying formats for unrecognised formats
  2. Convert all integer (Excel) dates.
  3. Combine both with fillna
parsed = pd.to_datetime(df["datecol"], errors="coerce").fillna(pd.to_datetime(df["datecol"],format="%Y-%d-%m",errors="coerce"))
ordinal = pd.to_numeric(df["datecol"], errors="coerce").apply(lambda x: pd.Timestamp("1899-12-30")+pd.Timedelta(x, unit="D"))

df["datecol"] = parsed.fillna(ordinal)

>>> df
     datecol
0        NaT
1 2021-01-02
2 2021-01-01
3 2020-04-14
4 2021-03-13

Upvotes: 3

ALollz
ALollz

Reputation: 59549

If a column contains multiple formats, you're going to need to parse the column multiple times with the different formats and use combine_first to combine the resulting information. Because we specify errors='coerce' the date format should only match one of the formats.

The other small complication is that some of your formats require you to just specify the format argument, but others would require the origin and unit parameters. We can take care of this passing a dict of kwargs to the pd.to_datetime function.

Note any numeric values will work with origin and unit so you can't use this method if your date column had values that represented different units with different offsets in the same column. You would need to provide other logic to indicate which units and offsets are pertinent to which rows in that case.

import pandas as pd
from functools import reduce 

kwl = [{'format': '%Y/%m/%d'},
       {'format': '%d.%m.%y'},
       {'format': '%Y-%d-%m'},
       {'format': '%Y/%m/%d'},
       {'unit': 'd', 'origin': '1899-12-30'}]]

l = []
for kwargs in kwl:
    if 'unit' in kwargs.keys():
        s = pd.to_numeric(df['datecol'], errors='coerce')
    else:
        s = df['datecol']
        
    l.append(pd.to_datetime(s, errors='coerce', **kwargs))

result = reduce(lambda l,r: l.combine_first(r), l)

print(result)
#0          NaT
#1   2021-01-02
#2   2021-01-01
#3   2020-04-14
#4   2021-03-13
Name: datecol, dtype: datetime64[ns]

Upvotes: 2

Related Questions