Reputation: 300
I am working with a rather large dataset. After applying the resample command in combination with the conversion method "prod" (multiplication), I realized that my NaN values were changed to 1, which is not what I intended. To give an example what happened:
# build random dataframe with one column containing NaN
import pandas as pd
import numpy as np
index = pd.date_range('1/1/2000', periods=7, freq='d')
df = pd.DataFrame(index = index, columns = ["Score 1", "Score 2", "Score 3"])
df["Score 1"] = np.random.randint(1,20,size=7)
df["Score 2"] = np.random.randint(1,20,size=7)
df["Score 3"] = [1, 2, 3, np.NaN, np.NaN, np.NaN, np.NaN]
print(df)
Score 1 Score 2 Score 3
2000-01-01 6 7 1.0
2000-01-02 2 15 2.0
2000-01-03 8 19 3.0
2000-01-04 14 19 NaN
2000-01-05 17 8 NaN
2000-01-06 15 6 NaN
2000-01-07 12 18 NaN
Now lets say I want to resample my Dataframe from a daily to a 3-day Frequency with using the "prod" conversion method. I do so by:
df.resample("3d").agg("prod")
print(df)
Score 1 Score 2 Score 3
2000-01-01 96 1995 6.0
2000-01-04 3570 2052 1.0
2000-01-07 12 18 1.0
Looking at the column "Score 3", my NaN values suddenly changed to 1, which is a surprise for me. This means that when multiplying NaN with each other, I would get =1. Does anyone why exactly a multiplication of NaN's equals one and what I could do to keep the NaN value in case it is multiplicated with itself?
Thanks in advance, any help is highly appreciated
Upvotes: 2
Views: 323
Reputation: 25564
pandas.DataFrame.prod
function (docs) by default sets NaN
to 1:
pd.Series([np.NaN, np.NaN]).prod()
# 1.0
You can circumvent this by setting the according keyword:
pd.Series([np.NaN, np.NaN]).prod(skipna=False)
# nan
In your case, you could apply that as
print(df)
Score 1 Score 2 Score 3
2000-01-01 18 19 1.0
2000-01-02 9 18 2.0
2000-01-03 10 4 3.0
2000-01-04 4 15 4.0
2000-01-05 12 1 NaN
2000-01-06 1 3 NaN
2000-01-07 8 9 NaN
print(df.resample("3d").agg(pd.DataFrame.prod, skipna=False))
Score 1 Score 2 Score 3
2000-01-01 1620 1368 6.0
2000-01-04 48 45 NaN
2000-01-07 8 9 NaN
Note that this will set all resampled time windows to NaN
if the window contains at least one NaN
value - I changed the example df
slightly to show that. You can apply
a lambda
instead, checking if at least one element is not NaN
:
print(df.resample("3d").apply(lambda x: x.prod() if any(x.notnull()) else np.nan))
Score 1 Score 2 Score 3
2000-01-01 1620 1368 6.0
2000-01-04 48 45 4.0
2000-01-07 8 9 NaN
Upvotes: 1