cardamom
cardamom

Reputation: 7421

Pandas 'astype' with date (or datetime)

This answer contains a very elegant way of setting all the types of your pandas columns in one line:

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

I am starting to think that that unfortunately has limited application and you will have to use various other methods of casting the column types sooner or later, over many lines. I tested 'category' and that worked, so it will take things which are actual python types like int or complex and then pandas terms in quotation marks like 'category'.

I have a column of dates which looks like this:

25.07.10
08.08.10
07.01.11

I had a look at this answer about casting date columns but none of them seem to fit into the elegant syntax above.

I tried:

from datetime import date
df = df.astype({"date": date})

but it gave an error:

TypeError: dtype '<class 'datetime.date'>' not understood

I also tried pd.Series.dt.date which also didn't work. Is it possible to cast all your columns including the date or datetime column in one line like this?

Upvotes: 17

Views: 83291

Answers (2)

cottontail
cottontail

Reputation: 23081

datetime

Since you can't pass datetime format to astype(), it's a little primitive and it's better to use pd.to_datetime() instead.

df['date'] = pd.to_datetime(df['date'])

For example, if the dates in the data are of the format %d/%m/%Y such as 01/04/2020, astype() would incorrectly parse it as 2020-01-04 whereas with pd.to_datetime(), you can pass the correct format.

If you need to convert multiple columns into datetime64 (which is often the reason astype() is used), then you can apply pd.to_datetime().

df = pd.DataFrame({'date1': ['01/04/2020'], 'date2': ['02/04/2020']})
df = df.apply(pd.to_datetime, format='%d/%m/%Y')

Even with read_csv, you have some control over the format, e.g.

df = pd.read_csv('file.csv', parse_dates=['date'], dayfirst=True)

date

If you want to cast into date, then you can first cast to datetime64[ns] and then use dt.date to get a column of datetime.date objects:

df['date'] = pd.to_datetime(df['date']).dt.date

The column dtype will become object though (on which you can still perform vectorized operations such as adding days, comparing dates etc.), so if you plan to work on it a lot in pandas, it's more performative to use datetime64 instead. For example, adding a day is extremely fast on datetime64 columns, not so much on date columns:

s_dt = pd.Series(pd.date_range('1700', None, 10000, 'D'))
s_d = s_dt.dt.date

%timeit x = s_dt + pd.Timedelta(days=1)
# 344 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit y = s_d + pd.Timedelta(days=1)
# 56.1 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

With that being said, if you dump it into a database (such as sqlite), an object dtype column of datetime.date objects stored as a DATE type (whereas datetime64[ns] will be stored as TIMESTAMP).


Pandas datetime dtype is from numpy datetime64, so if you have pandas<2.0, you can use the following as well (since pandas 2.0, unitless datetime64 is not supported anymore). There's no date dtype (although you can perform vectorized operations on a column that holds datetime.date values).

df = df.astype({'date': np.datetime64})

# or (on a little endian system)
df = df.astype({'date': '<M8'})
# (on a big endian system)
df = df.astype({'date': '>M8'})

Upvotes: 8

joelostblom
joelostblom

Reputation: 48909

This has been answered in the comments where it was noted that the following works:

df.astype({'date': 'datetime64[ns]'})

In addition, you can set the dtype when reading in the data:

pd.read_csv('path/to/file.csv', parse_dates=['date'])

Upvotes: 33

Related Questions