Reputation: 25555
Either I don't understand the documentation or it is outdated.
If I run
user[["DOC_ACC_DT", "USER_SIGNON_ID"]].groupby("DOC_ACC_DT").agg(["count"]).resample("1D").fillna(value=0, method="ffill")
It get
TypeError: fillna() got an unexpected keyword argument 'value'
If I just run
.fillna(0)
I get
ValueError: Invalid fill method. Expecting pad (ffill), backfill (bfill) or nearest. Got 0
If I then set
.fillna(0, method="ffill")
I get
TypeError: fillna() got multiple values for keyword argument 'method'
so the only thing that works is
.fillna("ffill")
but of course that makes just a forward fill. However, I want to replace NaN
with zeros. What am I doing wrong here?
Upvotes: 17
Views: 23041
Reputation: 920
You can simply use sum()
for it.
Check out the docs at https://pandas.pydata.org/docs/reference/api/pandas.core.resample.Resampler.sum.html
Basically there's a min_count
parameter, by default it's 0. This means that after you resample, if count(nan) <= min_count
, then the value will be nan. However, since it's 0, by default, the value will be 0. So no need to replace or fillna.
In fact, if you wanted to fill with a value that's not 0, you could set .sum(min_count=1)
then .replace(float('nan'), 'whatever you want')
Example below:
import pandas as pd
>>> df = pd.DataFrame({'date': pd.date_range('2021-01-01', '2021-01-07', freq='3D'),
'value': range(3)})
>>> df
date value
0 2021-01-01 0
1 2021-01-04 1
2 2021-01-07 2
>>> df.set_index('date').resample('D').sum().reset_index()
date value
0 2021-01-01 0
1 2021-01-02 0
2 2021-01-03 0
3 2021-01-04 1
4 2021-01-05 0
5 2021-01-06 0
6 2021-01-07 2
# if you wanna replace nan with some other values, could also use replace() if more than
# 1 column to replace
>>> df.set_index('date').resample('D').sum(min_count=1).fillna(-10).reset_index()
date value
0 2021-01-01 0.0
1 2021-01-02 -10.0
2 2021-01-03 -10.0
3 2021-01-04 1.0
4 2021-01-05 -10.0
5 2021-01-06 -10.0
6 2021-01-07 2.0
Upvotes: 3
Reputation: 2052
The issue here is that you try to call the fillna
method from DatetimeIndexResampler
object, which is returned by the resample
method. If you call an aggregation function before fillna it will work, for example: df.resample('1H').sum().fillna(0)
Upvotes: 2
Reputation: 863721
I do some test and it is very interesting.
Sample:
import pandas as pd
import numpy as np
np.random.seed(1)
rng = pd.date_range('1/1/2012', periods=20, freq='S')
df = pd.DataFrame({'a':['a'] * 10 + ['b'] * 10,
'b':np.random.randint(0, 500, len(rng))}, index=rng)
df.b.iloc[3:8] = np.nan
print (df)
a b
2012-01-01 00:00:00 a 37.0
2012-01-01 00:00:01 a 235.0
2012-01-01 00:00:02 a 396.0
2012-01-01 00:00:03 a NaN
2012-01-01 00:00:04 a NaN
2012-01-01 00:00:05 a NaN
2012-01-01 00:00:06 a NaN
2012-01-01 00:00:07 a NaN
2012-01-01 00:00:08 a 335.0
2012-01-01 00:00:09 a 448.0
2012-01-01 00:00:10 b 144.0
2012-01-01 00:00:11 b 129.0
2012-01-01 00:00:12 b 460.0
2012-01-01 00:00:13 b 71.0
2012-01-01 00:00:14 b 237.0
2012-01-01 00:00:15 b 390.0
2012-01-01 00:00:16 b 281.0
2012-01-01 00:00:17 b 178.0
2012-01-01 00:00:18 b 276.0
2012-01-01 00:00:19 b 254.0
Downsampling:
Possible solution with Resampler.asfreq
:
If use asfreq
, behaviour is same aggregating by first
:
print (df.groupby('a').resample('2S').first())
a b
a
a 2012-01-01 00:00:00 a 37.0
2012-01-01 00:00:02 a 396.0
2012-01-01 00:00:04 a NaN
2012-01-01 00:00:06 a NaN
2012-01-01 00:00:08 a 335.0
b 2012-01-01 00:00:10 b 144.0
2012-01-01 00:00:12 b 460.0
2012-01-01 00:00:14 b 237.0
2012-01-01 00:00:16 b 281.0
2012-01-01 00:00:18 b 276.0
print (df.groupby('a').resample('2S').first().fillna(0))
a b
a
a 2012-01-01 00:00:00 a 37.0
2012-01-01 00:00:02 a 396.0
2012-01-01 00:00:04 a 0.0
2012-01-01 00:00:06 a 0.0
2012-01-01 00:00:08 a 335.0
b 2012-01-01 00:00:10 b 144.0
2012-01-01 00:00:12 b 460.0
2012-01-01 00:00:14 b 237.0
2012-01-01 00:00:16 b 281.0
2012-01-01 00:00:18 b 276.0
print (df.groupby('a').resample('2S').asfreq().fillna(0))
a b
a
a 2012-01-01 00:00:00 a 37.0
2012-01-01 00:00:02 a 396.0
2012-01-01 00:00:04 a 0.0
2012-01-01 00:00:06 a 0.0
2012-01-01 00:00:08 a 335.0
b 2012-01-01 00:00:10 b 144.0
2012-01-01 00:00:12 b 460.0
2012-01-01 00:00:14 b 237.0
2012-01-01 00:00:16 b 281.0
2012-01-01 00:00:18 b 276.0
If use replace
another values are aggregating as mean
:
print (df.groupby('a').resample('2S').mean())
b
a
a 2012-01-01 00:00:00 136.0
2012-01-01 00:00:02 396.0
2012-01-01 00:00:04 NaN
2012-01-01 00:00:06 NaN
2012-01-01 00:00:08 391.5
b 2012-01-01 00:00:10 136.5
2012-01-01 00:00:12 265.5
2012-01-01 00:00:14 313.5
2012-01-01 00:00:16 229.5
2012-01-01 00:00:18 265.0
print (df.groupby('a').resample('2S').mean().fillna(0))
b
a
a 2012-01-01 00:00:00 136.0
2012-01-01 00:00:02 396.0
2012-01-01 00:00:04 0.0
2012-01-01 00:00:06 0.0
2012-01-01 00:00:08 391.5
b 2012-01-01 00:00:10 136.5
2012-01-01 00:00:12 265.5
2012-01-01 00:00:14 313.5
2012-01-01 00:00:16 229.5
2012-01-01 00:00:18 265.0
print (df.groupby('a').resample('2S').replace(np.nan,0))
b
a
a 2012-01-01 00:00:00 136.0
2012-01-01 00:00:02 396.0
2012-01-01 00:00:04 0.0
2012-01-01 00:00:06 0.0
2012-01-01 00:00:08 391.5
b 2012-01-01 00:00:10 136.5
2012-01-01 00:00:12 265.5
2012-01-01 00:00:14 313.5
2012-01-01 00:00:16 229.5
2012-01-01 00:00:18 265.0
Upsampling:
Use asfreq
, it is same as replace
:
print (df.groupby('a').resample('200L').asfreq().fillna(0))
a b
a
a 2012-01-01 00:00:00.000 a 37.0
2012-01-01 00:00:00.200 0 0.0
2012-01-01 00:00:00.400 0 0.0
2012-01-01 00:00:00.600 0 0.0
2012-01-01 00:00:00.800 0 0.0
2012-01-01 00:00:01.000 a 235.0
2012-01-01 00:00:01.200 0 0.0
2012-01-01 00:00:01.400 0 0.0
2012-01-01 00:00:01.600 0 0.0
2012-01-01 00:00:01.800 0 0.0
2012-01-01 00:00:02.000 a 396.0
2012-01-01 00:00:02.200 0 0.0
2012-01-01 00:00:02.400 0 0.0
...
print (df.groupby('a').resample('200L').replace(np.nan,0))
b
a
a 2012-01-01 00:00:00.000 37.0
2012-01-01 00:00:00.200 0.0
2012-01-01 00:00:00.400 0.0
2012-01-01 00:00:00.600 0.0
2012-01-01 00:00:00.800 0.0
2012-01-01 00:00:01.000 235.0
2012-01-01 00:00:01.200 0.0
2012-01-01 00:00:01.400 0.0
2012-01-01 00:00:01.600 0.0
2012-01-01 00:00:01.800 0.0
2012-01-01 00:00:02.000 396.0
2012-01-01 00:00:02.200 0.0
2012-01-01 00:00:02.400 0.0
...
print ((df.groupby('a').resample('200L').replace(np.nan,0).b ==
df.groupby('a').resample('200L').asfreq().fillna(0).b).all())
True
Conclusion:
For downsampling use same aggregating function like sum
, first
or mean
and for upsampling asfreq
.
Upvotes: 8
Reputation: 29719
The only workaround close to using fillna
directly would be to call it after performing .head(len(df.index))
.
I'm presuming DF.head
to be useful in this case mainly because when resample function is applied to a groupby object, it will act as a filter on the input, returning a reduced shape of the original due to elimination of groups.
Calling DF.head()
does not get affected by this transformation and returns the entire DF
.
Demo:
np.random.seed(42)
df = pd.DataFrame(np.random.randn(10, 2),
index=pd.date_range('1/1/2016', freq='10D', periods=10),
columns=['A', 'B']).reset_index()
df
index A B
0 2016-01-01 0.496714 -0.138264
1 2016-01-11 0.647689 1.523030
2 2016-01-21 -0.234153 -0.234137
3 2016-01-31 1.579213 0.767435
4 2016-02-10 -0.469474 0.542560
5 2016-02-20 -0.463418 -0.465730
6 2016-03-01 0.241962 -1.913280
7 2016-03-11 -1.724918 -0.562288
8 2016-03-21 -1.012831 0.314247
9 2016-03-31 -0.908024 -1.412304
Operations:
resampled_group = df[['index', 'A']].groupby(['index'])['A'].agg('count').resample('2D')
resampled_group.head(len(resampled_group.index)).fillna(0).head(20)
index
2016-01-01 1.0
2016-01-03 0.0
2016-01-05 0.0
2016-01-07 0.0
2016-01-09 0.0
2016-01-11 1.0
2016-01-13 0.0
2016-01-15 0.0
2016-01-17 0.0
2016-01-19 0.0
2016-01-21 1.0
2016-01-23 0.0
2016-01-25 0.0
2016-01-27 0.0
2016-01-29 0.0
2016-01-31 1.0
2016-02-02 0.0
2016-02-04 0.0
2016-02-06 0.0
2016-02-08 0.0
Freq: 2D, Name: A, dtype: float64
Upvotes: 0
Reputation: 25555
Well, I don't get why the code above is not working and I'm going to wait for somebody to give a better answer than this but I just found
.replace(np.nan, 0)
does what I would have expected from .fillna(0)
.
Upvotes: 12