Reputation: 25555

How to fillna() with value 0 after calling resample?

Either I don't understand the documentation or it is outdated.

If I run

user[["DOC_ACC_DT", "USER_SIGNON_ID"]].groupby("DOC_ACC_DT").agg(["count"]).resample("1D").fillna(value=0, method="ffill")

It get

TypeError: fillna() got an unexpected keyword argument 'value'

If I just run

.fillna(0)

I get

ValueError: Invalid fill method. Expecting pad (ffill), backfill (bfill) or nearest. Got 0

If I then set

.fillna(0, method="ffill")

I get

TypeError: fillna() got multiple values for keyword argument 'method'

so the only thing that works is

.fillna("ffill")

but of course that makes just a forward fill. However, I want to replace NaN with zeros. What am I doing wrong here?

Upvotes: 17

Answers (5)

dbokers

Reputation: 920

You can simply use sum() for it. Check out the docs at https://pandas.pydata.org/docs/reference/api/pandas.core.resample.Resampler.sum.html

Basically there's a min_count parameter, by default it's 0. This means that after you resample, if count(nan) <= min_count, then the value will be nan. However, since it's 0, by default, the value will be 0. So no need to replace or fillna.

In fact, if you wanted to fill with a value that's not 0, you could set .sum(min_count=1) then .replace(float('nan'), 'whatever you want')

Example below:

import pandas as pd

>>> df = pd.DataFrame({'date': pd.date_range('2021-01-01', '2021-01-07', freq='3D'),
                       'value': range(3)})
>>> df
        date  value
0 2021-01-01      0
1 2021-01-04      1
2 2021-01-07      2

>>> df.set_index('date').resample('D').sum().reset_index()
        date  value
0 2021-01-01      0
1 2021-01-02      0
2 2021-01-03      0
3 2021-01-04      1
4 2021-01-05      0
5 2021-01-06      0
6 2021-01-07      2

# if you wanna replace nan with some other values, could also use replace() if more than
# 1 column to replace
>>> df.set_index('date').resample('D').sum(min_count=1).fillna(-10).reset_index()
        date  value
0 2021-01-01    0.0
1 2021-01-02  -10.0
2 2021-01-03  -10.0
3 2021-01-04    1.0
4 2021-01-05  -10.0
5 2021-01-06  -10.0
6 2021-01-07    2.0

Upvotes: 3

Ryszard Cetnarski

Reputation: 2052

The issue here is that you try to call the fillna method from DatetimeIndexResampler object, which is returned by the resample method. If you call an aggregation function before fillna it will work, for example: df.resample('1H').sum().fillna(0)

Upvotes: 2

jezrael

Reputation: 863721

I do some test and it is very interesting.

Sample:

import pandas as pd
import numpy as np

np.random.seed(1)
rng = pd.date_range('1/1/2012', periods=20, freq='S')
df = pd.DataFrame({'a':['a'] * 10 + ['b'] * 10,
                   'b':np.random.randint(0, 500, len(rng))}, index=rng)
df.b.iloc[3:8] = np.nan
print (df)
                     a      b
2012-01-01 00:00:00  a   37.0
2012-01-01 00:00:01  a  235.0
2012-01-01 00:00:02  a  396.0
2012-01-01 00:00:03  a    NaN
2012-01-01 00:00:04  a    NaN
2012-01-01 00:00:05  a    NaN
2012-01-01 00:00:06  a    NaN
2012-01-01 00:00:07  a    NaN
2012-01-01 00:00:08  a  335.0
2012-01-01 00:00:09  a  448.0
2012-01-01 00:00:10  b  144.0
2012-01-01 00:00:11  b  129.0
2012-01-01 00:00:12  b  460.0
2012-01-01 00:00:13  b   71.0
2012-01-01 00:00:14  b  237.0
2012-01-01 00:00:15  b  390.0
2012-01-01 00:00:16  b  281.0
2012-01-01 00:00:17  b  178.0
2012-01-01 00:00:18  b  276.0
2012-01-01 00:00:19  b  254.0

Downsampling:

Possible solution with Resampler.asfreq:

If use asfreq, behaviour is same aggregating by first:

print (df.groupby('a').resample('2S').first())
                       a      b
a                              
a 2012-01-01 00:00:00  a   37.0
  2012-01-01 00:00:02  a  396.0
  2012-01-01 00:00:04  a    NaN
  2012-01-01 00:00:06  a    NaN
  2012-01-01 00:00:08  a  335.0
b 2012-01-01 00:00:10  b  144.0
  2012-01-01 00:00:12  b  460.0
  2012-01-01 00:00:14  b  237.0
  2012-01-01 00:00:16  b  281.0
  2012-01-01 00:00:18  b  276.0

print (df.groupby('a').resample('2S').first().fillna(0))
                       a      b
a                              
a 2012-01-01 00:00:00  a   37.0
  2012-01-01 00:00:02  a  396.0
  2012-01-01 00:00:04  a    0.0
  2012-01-01 00:00:06  a    0.0
  2012-01-01 00:00:08  a  335.0
b 2012-01-01 00:00:10  b  144.0
  2012-01-01 00:00:12  b  460.0
  2012-01-01 00:00:14  b  237.0
  2012-01-01 00:00:16  b  281.0
  2012-01-01 00:00:18  b  276.0

print (df.groupby('a').resample('2S').asfreq().fillna(0))
                       a      b
a                              
a 2012-01-01 00:00:00  a   37.0
  2012-01-01 00:00:02  a  396.0
  2012-01-01 00:00:04  a    0.0
  2012-01-01 00:00:06  a    0.0
  2012-01-01 00:00:08  a  335.0
b 2012-01-01 00:00:10  b  144.0
  2012-01-01 00:00:12  b  460.0
  2012-01-01 00:00:14  b  237.0
  2012-01-01 00:00:16  b  281.0
  2012-01-01 00:00:18  b  276.0

If use replace another values are aggregating as mean:

print (df.groupby('a').resample('2S').mean())
                           b
a                           
a 2012-01-01 00:00:00  136.0
  2012-01-01 00:00:02  396.0
  2012-01-01 00:00:04    NaN
  2012-01-01 00:00:06    NaN
  2012-01-01 00:00:08  391.5
b 2012-01-01 00:00:10  136.5
  2012-01-01 00:00:12  265.5
  2012-01-01 00:00:14  313.5
  2012-01-01 00:00:16  229.5
  2012-01-01 00:00:18  265.0

print (df.groupby('a').resample('2S').mean().fillna(0))
                           b
a                           
a 2012-01-01 00:00:00  136.0
  2012-01-01 00:00:02  396.0
  2012-01-01 00:00:04    0.0
  2012-01-01 00:00:06    0.0
  2012-01-01 00:00:08  391.5
b 2012-01-01 00:00:10  136.5
  2012-01-01 00:00:12  265.5
  2012-01-01 00:00:14  313.5
  2012-01-01 00:00:16  229.5
  2012-01-01 00:00:18  265.0

print (df.groupby('a').resample('2S').replace(np.nan,0))
                           b
a                           
a 2012-01-01 00:00:00  136.0
  2012-01-01 00:00:02  396.0
  2012-01-01 00:00:04    0.0
  2012-01-01 00:00:06    0.0
  2012-01-01 00:00:08  391.5
b 2012-01-01 00:00:10  136.5
  2012-01-01 00:00:12  265.5
  2012-01-01 00:00:14  313.5
  2012-01-01 00:00:16  229.5
  2012-01-01 00:00:18  265.0

Upsampling:

Use asfreq, it is same as replace:

print (df.groupby('a').resample('200L').asfreq().fillna(0))
                           a      b
a                                  
a 2012-01-01 00:00:00.000  a   37.0
  2012-01-01 00:00:00.200  0    0.0
  2012-01-01 00:00:00.400  0    0.0
  2012-01-01 00:00:00.600  0    0.0
  2012-01-01 00:00:00.800  0    0.0
  2012-01-01 00:00:01.000  a  235.0
  2012-01-01 00:00:01.200  0    0.0
  2012-01-01 00:00:01.400  0    0.0
  2012-01-01 00:00:01.600  0    0.0
  2012-01-01 00:00:01.800  0    0.0
  2012-01-01 00:00:02.000  a  396.0
  2012-01-01 00:00:02.200  0    0.0
  2012-01-01 00:00:02.400  0    0.0
  ...

print (df.groupby('a').resample('200L').replace(np.nan,0))
                               b
a                               
a 2012-01-01 00:00:00.000   37.0
  2012-01-01 00:00:00.200    0.0
  2012-01-01 00:00:00.400    0.0
  2012-01-01 00:00:00.600    0.0
  2012-01-01 00:00:00.800    0.0
  2012-01-01 00:00:01.000  235.0
  2012-01-01 00:00:01.200    0.0
  2012-01-01 00:00:01.400    0.0
  2012-01-01 00:00:01.600    0.0
  2012-01-01 00:00:01.800    0.0
  2012-01-01 00:00:02.000  396.0
  2012-01-01 00:00:02.200    0.0
  2012-01-01 00:00:02.400    0.0
  ...

print ((df.groupby('a').resample('200L').replace(np.nan,0).b == 
       df.groupby('a').resample('200L').asfreq().fillna(0).b).all())
True

Conclusion:

For downsampling use same aggregating function like sum, first or mean and for upsampling asfreq.

Upvotes: 8

Nickil Maveli

Reputation: 29719

The only workaround close to using fillna directly would be to call it after performing .head(len(df.index)).

I'm presuming DF.head to be useful in this case mainly because when resample function is applied to a groupby object, it will act as a filter on the input, returning a reduced shape of the original due to elimination of groups.

Calling DF.head() does not get affected by this transformation and returns the entire DF.

Demo:

np.random.seed(42)

df = pd.DataFrame(np.random.randn(10, 2),
              index=pd.date_range('1/1/2016', freq='10D', periods=10),
              columns=['A', 'B']).reset_index()

df
       index         A         B
0 2016-01-01  0.496714 -0.138264
1 2016-01-11  0.647689  1.523030
2 2016-01-21 -0.234153 -0.234137
3 2016-01-31  1.579213  0.767435
4 2016-02-10 -0.469474  0.542560
5 2016-02-20 -0.463418 -0.465730
6 2016-03-01  0.241962 -1.913280
7 2016-03-11 -1.724918 -0.562288
8 2016-03-21 -1.012831  0.314247
9 2016-03-31 -0.908024 -1.412304

Operations:

resampled_group = df[['index', 'A']].groupby(['index'])['A'].agg('count').resample('2D')
resampled_group.head(len(resampled_group.index)).fillna(0).head(20)

index
2016-01-01    1.0
2016-01-03    0.0
2016-01-05    0.0
2016-01-07    0.0
2016-01-09    0.0
2016-01-11    1.0
2016-01-13    0.0
2016-01-15    0.0
2016-01-17    0.0
2016-01-19    0.0
2016-01-21    1.0
2016-01-23    0.0
2016-01-25    0.0
2016-01-27    0.0
2016-01-29    0.0
2016-01-31    1.0
2016-02-02    0.0
2016-02-04    0.0
2016-02-06    0.0
2016-02-08    0.0
Freq: 2D, Name: A, dtype: float64

Upvotes: 0

Stefan Falk

Reputation: 25555

Well, I don't get why the code above is not working and I'm going to wait for somebody to give a better answer than this but I just found

.replace(np.nan, 0)

does what I would have expected from .fillna(0).

Upvotes: 12

How to fillna() with value 0 after calling resample?

Answers (5)

Related Questions