Reputation: 327
I am using pandas 0.17.0 and have a df
similar to this one:
df.head()
Out[339]:
A B C
DATE_TIME
2016-10-08 13:57:00 in 5.61 1
2016-10-08 14:02:00 in 8.05 1
2016-10-08 14:07:00 in 7.92 0
2016-10-08 14:12:00 in 7.98 0
2016-10-08 14:17:00 out 8.18 0
df.tail()
Out[340]:
A B C
DATE_TIME
2016-11-08 13:42:00 in 8.00 0
2016-11-08 13:47:00 in 7.99 0
2016-11-08 13:52:00 out 7.97 0
2016-11-08 13:57:00 in 8.14 1
2016-11-08 14:02:00 in 8.16 1
with following dtypes
:
print (df.dtypes)
A object
B float64
C int64
dtype: object
When I reindex my df
to minute intervals all the columns int64
change to float64
.
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index)
print (df2.dtypes)
A object
B float64
C float64
dtype: object
Also, if I try to resample
df3 = df.resample('Min')
The int64
will turn into a float64
and for some reason I loose my object
column.
print (df3.dtypes)
print (df3.dtypes)
B float64
C float64
dtype: object
Since I want to interpolate the columns differently based on this distinction in an subsequent step (after concatenating the df
with another df
), I need them to maintain their original dtype
. My real df
has far more columns of each type, for which reason I am looking for a solution that does not depend on calling the columns individually by their label.
Is there a way to maintain their dtype
throughout the reindexing? Or is there a way how I can assign them their dtype
afterwards (they are the only columns consisiting only of integers besides NANs)?
Can anybody help me?
Upvotes: 16
Views: 2787
Reputation: 583
It might become possible out of the box in the future. Nullable integer series are an experimental feature at the time of writing. As stated in the documentation:
IntegerArray is currently experimental. Its API or implementation may change without warning.
For production code, I suggest following jezrael's answer, namely casting types after reindexing.
import pandas as pd
# OP's data.
df = pd.DataFrame(
data={
'A': ['in', 'in', 'in', 'in', 'out'],
'B': [5.61, 8.05, 7.92, 7.98, 8.18],
'C': [1, 1, 0, 0, 0],
},
index=pd.to_datetime(['2016-10-08 13:57:00', '2016-10-08 14:02:00', '2016-10-08 14:07:00', '2016-10-08 14:12:00', '2016-10-08 14:17:00'])
)
# Cast integer column into nullable integer type.
df['C'] = df['C'].astype('Int64') # Note the capital I.
# Reindex and check.
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index)
print(df2.dtypes)
yields
A object
B float64
C Int64
dtype: object
Upvotes: 0
Reputation: 863671
It is impossible, because if you get at least one NaN
value in some column, int
is converted to float
.
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index)
print (df2)
A B C
2016-10-08 13:57:00 in 5.61 1.0
2016-10-08 13:58:00 NaN NaN NaN
2016-10-08 13:59:00 NaN NaN NaN
2016-10-08 14:00:00 NaN NaN NaN
2016-10-08 14:01:00 NaN NaN NaN
2016-10-08 14:02:00 in 8.05 1.0
2016-10-08 14:03:00 NaN NaN NaN
2016-10-08 14:04:00 NaN NaN NaN
2016-10-08 14:05:00 NaN NaN NaN
2016-10-08 14:06:00 NaN NaN NaN
2016-10-08 14:07:00 in 7.92 0.0
2016-10-08 14:08:00 NaN NaN NaN
2016-10-08 14:09:00 NaN NaN NaN
2016-10-08 14:10:00 NaN NaN NaN
2016-10-08 14:11:00 NaN NaN NaN
2016-10-08 14:12:00 in 7.98 0.0
2016-10-08 14:13:00 NaN NaN NaN
2016-10-08 14:14:00 NaN NaN NaN
2016-10-08 14:15:00 NaN NaN NaN
2016-10-08 14:16:00 NaN NaN NaN
2016-10-08 14:17:00 out 8.18 0.0
print (df2.dtypes)
A object
B float64
C float64
dtype: object
But if you use parameter fill_value
in reindex
, dtypes
are not changed:
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index, fill_value=0)
print (df2)
A B C
2016-10-08 13:57:00 in 5.61 1
2016-10-08 13:58:00 0 0.00 0
2016-10-08 13:59:00 0 0.00 0
2016-10-08 14:00:00 0 0.00 0
2016-10-08 14:01:00 0 0.00 0
2016-10-08 14:02:00 in 8.05 1
2016-10-08 14:03:00 0 0.00 0
2016-10-08 14:04:00 0 0.00 0
2016-10-08 14:05:00 0 0.00 0
2016-10-08 14:06:00 0 0.00 0
2016-10-08 14:07:00 in 7.92 0
2016-10-08 14:08:00 0 0.00 0
2016-10-08 14:09:00 0 0.00 0
2016-10-08 14:10:00 0 0.00 0
2016-10-08 14:11:00 0 0.00 0
2016-10-08 14:12:00 in 7.98 0
2016-10-08 14:13:00 0 0.00 0
2016-10-08 14:14:00 0 0.00 0
2016-10-08 14:15:00 0 0.00 0
2016-10-08 14:16:00 0 0.00 0
2016-10-08 14:17:00 out 8.18 0
print (df2.dtypes)
A object
B float64
C int64
dtype: object
Better is to use method='ffill
in reindex
:
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index, method='ffill')
print (df2)
A B C
2016-10-08 13:57:00 in 5.61 1
2016-10-08 13:58:00 in 5.61 1
2016-10-08 13:59:00 in 5.61 1
2016-10-08 14:00:00 in 5.61 1
2016-10-08 14:01:00 in 5.61 1
2016-10-08 14:02:00 in 8.05 1
2016-10-08 14:03:00 in 8.05 1
2016-10-08 14:04:00 in 8.05 1
2016-10-08 14:05:00 in 8.05 1
2016-10-08 14:06:00 in 8.05 1
2016-10-08 14:07:00 in 7.92 0
2016-10-08 14:08:00 in 7.92 0
2016-10-08 14:09:00 in 7.92 0
2016-10-08 14:10:00 in 7.92 0
2016-10-08 14:11:00 in 7.92 0
2016-10-08 14:12:00 in 7.98 0
2016-10-08 14:13:00 in 7.98 0
2016-10-08 14:14:00 in 7.98 0
2016-10-08 14:15:00 in 7.98 0
2016-10-08 14:16:00 in 7.98 0
2016-10-08 14:17:00 out 8.18 0
print (df2.dtypes)
A object
B float64
C int64
dtype: object
If you use resample
, you can get column A
back by unstack
and stack
, but unfortunately there is still a problem with float
:
df3 = df.set_index('A', append=True)
.unstack()
.resample('Min', fill_method='ffill')
.stack()
.reset_index(level=1)
print (df3)
A B C
DATE_TIME
2016-10-08 13:57:00 in 5.61 1.0
2016-10-08 13:58:00 in 5.61 1.0
2016-10-08 13:59:00 in 5.61 1.0
2016-10-08 14:00:00 in 5.61 1.0
2016-10-08 14:01:00 in 5.61 1.0
2016-10-08 14:02:00 in 8.05 1.0
2016-10-08 14:03:00 in 8.05 1.0
2016-10-08 14:04:00 in 8.05 1.0
2016-10-08 14:05:00 in 8.05 1.0
2016-10-08 14:06:00 in 8.05 1.0
2016-10-08 14:07:00 in 7.92 0.0
2016-10-08 14:08:00 in 7.92 0.0
2016-10-08 14:09:00 in 7.92 0.0
2016-10-08 14:10:00 in 7.92 0.0
2016-10-08 14:11:00 in 7.92 0.0
2016-10-08 14:12:00 in 7.98 0.0
2016-10-08 14:13:00 in 7.98 0.0
2016-10-08 14:14:00 in 7.98 0.0
2016-10-08 14:15:00 in 7.98 0.0
2016-10-08 14:16:00 in 7.98 0.0
2016-10-08 14:17:00 out 8.18 0.0
print (df3.dtypes)
A object
B float64
C float64
dtype: object
I modified a previous answer for casting to `int:
int_cols = df.select_dtypes(['int64']).columns
print (int_cols)
Index(['C'], dtype='object')
index = pd.date_range(df.index[0], df.index[-1], freq="s")
df2 = df.reindex(index)
for col in df2:
if col == int_cols:
df2[col].ffill(inplace=True)
df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
df2[col].interpolate(inplace=True)
else:
df2[col].ffill(inplace=True)
#print (df2)
print (df2.dtypes)
A object
B float64
C int32
dtype: object
Upvotes: 14