Reputation: 816
If I have a Pandas data frame like this:
1 2 3 4 5 6 7
1 NaN 1 1 1 NaN 1 1
2 NaN NaN 1 1 1 1 1
3 NaN NaN NaN 1 NaN 1 1
4 1 1 NaN NaN 1 1 NaN
How do I do a cumulative sum such that the count resets every time there is a NaN value in the row? Such that I get something like this:
1 2 3 4 5 6 7
1 NaN 1 2 3 NaN 1 2
2 NaN NaN 1 2 3 4 5
3 NaN NaN NaN 1 NaN 1 2
4 1 2 NaN NaN 1 2 NaN
Upvotes: 3
Views: 1478
Reputation: 1542
I came up with a slightly different answer here that might be helpful.
For as single series I made this function to to do the cumsum-reset on nulls.
def cumsum_reset_on_null(srs: pd.Series) -> pd.Series:
"""
For a pandas series with null values,
do a cumsum and reset the cumulative sum when a null value is encountered.
Example)
input: [1, 1, np.nan, 1, 2, 3, np.nan, 1, np.nan]
return: [1, 2, 0, 1, 3, 6, 0, 1, 0]
"""
cumulative = srs.cumsum().fillna(method='ffill')
restart = ((cumulative * srs.isnull()).replace(0.0, np.nan)
.fillna(method='ffill').fillna(0))
result = (cumulative - restart)
return result.replace(0, np.nan)
Then for the full dataframe, just apply this function row-wise
df = pd.DataFrame([
[np.nan, 1, 1, 1, np.nan, 1, 1],
[np.nan, np.nan, 1, 1, 1, 1, 1],
[np.nan, np.nan, np.nan, 1, np.nan, 1, 1],
[1, 1, np.nan, np.nan, 1, 1, np.nan],
])
df.apply(cumsum_reset_on_null, axis=1)
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN
Upvotes: 0
Reputation: 323226
You can do with stack
and unstack
s=df.stack(dropna=False).isnull().cumsum()
df=df.where(df.isnull(),s.groupby(s).cumcount().unstack())
df
Out[86]:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1 2.0
2 NaN NaN 1.0 2.0 3.0 4 5.0
3 NaN NaN NaN 1.0 NaN 1 2.0
4 3.0 4.0 NaN NaN 1.0 2 NaN
Upvotes: 0
Reputation: 176
One of the way can be:
sample = pd.DataFrame({1:[np.nan,np.nan,np.nan,1],2:[1,np.nan,np.nan,1],3:[1,1,np.nan,np.nan],4:[1,1,1,np.nan],5:[np.nan,1,np.nan,1],6:[1,1,1,1],7:[1,1,1,np.nan]},index=[1,2,3,4])
Output of sample
1 2 3 4 5 6 7
1 NaN 1.0 1.0 1.0 NaN 1 1.0
2 NaN NaN 1.0 1.0 1.0 1 1.0
3 NaN NaN NaN 1.0 NaN 1 1.0
4 1.0 1.0 NaN NaN 1.0 1 NaN
Following code would do:
#numr = number of rows
#numc = number of columns
numr,numc = sample.shape
for i in range(numr):
s=0
flag=0
for j in range(numc):
if np.isnan(sample.iloc[i,j]):
flag=1
else:
if flag==1:
s=sample.iloc[i,j]
flag=0
else:
s+=sample.iloc[i,j]
sample.iloc[i,j]=s
Output:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1.0 2.0
2 NaN NaN 1.0 2.0 3.0 4.0 5.0
3 NaN NaN NaN 1.0 NaN 1.0 2.0
4 1.0 2.0 NaN NaN 1.0 2.0 NaN
Upvotes: -1
Reputation: 61910
You could do:
# compute mask where np.nan = True
mask = pd.isna(df).astype(bool)
# compute cumsum across rows fillna with ffill
cumulative = df.cumsum(1).fillna(method='ffill', axis=1).fillna(0)
# get the values of cumulative where nan is True use the same method
restart = cumulative[mask].fillna(method='ffill', axis=1).fillna(0)
# set the result
result = (cumulative - restart)
result[mask] = np.nan
# display the result
print(result)
Output
1 2 3 4 5 6 7
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN
Upvotes: 6