Zmann3000
Zmann3000

Reputation: 816

How to reset cumulative sum every time there is a NaN in a pandas dataframe?

If I have a Pandas data frame like this:

     1   2   3   4   5   6   7
 1  NaN  1   1   1  NaN  1   1
 2  NaN NaN  1   1   1   1   1 
 3  NaN NaN NaN  1  NaN  1   1
 4   1   1  NaN NaN  1   1  NaN

How do I do a cumulative sum such that the count resets every time there is a NaN value in the row? Such that I get something like this:

     1   2   3   4   5   6   7
 1  NaN  1   2   3  NaN  1   2
 2  NaN NaN  1   2   3   4   5 
 3  NaN NaN NaN  1  NaN  1   2
 4   1   2  NaN NaN  1   2  NaN

Upvotes: 3

Views: 1478

Answers (4)

pettinato
pettinato

Reputation: 1542

I came up with a slightly different answer here that might be helpful.

For as single series I made this function to to do the cumsum-reset on nulls.

def cumsum_reset_on_null(srs: pd.Series) -> pd.Series:
    """
    For a pandas series with null values,
    do a cumsum and reset the cumulative sum when a null value is encountered.
    Example)
      input:  [1, 1, np.nan, 1, 2, 3, np.nan, 1, np.nan]
      return: [1, 2, 0,      1, 3, 6, 0,      1,      0]
    """
    cumulative = srs.cumsum().fillna(method='ffill')
    restart = ((cumulative * srs.isnull()).replace(0.0, np.nan)
               .fillna(method='ffill').fillna(0))
    result = (cumulative - restart)
    return result.replace(0, np.nan)

Then for the full dataframe, just apply this function row-wise

df = pd.DataFrame([
    [np.nan, 1,      1,      1,      np.nan, 1, 1],
    [np.nan, np.nan, 1,      1,      1,      1, 1],
    [np.nan, np.nan, np.nan, 1,      np.nan, 1, 1],
    [1,      1,      np.nan, np.nan, 1,      1, np.nan],
])

df.apply(cumsum_reset_on_null, axis=1)
0   NaN     1.0     2.0     3.0     NaN     1.0     2.0
1   NaN     NaN     1.0     2.0     3.0     4.0     5.0
2   NaN     NaN     NaN     1.0     NaN     1.0     2.0
3   1.0     2.0     NaN     NaN     1.0     2.0     NaN

Upvotes: 0

BENY
BENY

Reputation: 323226

You can do with stack and unstack

s=df.stack(dropna=False).isnull().cumsum()
df=df.where(df.isnull(),s.groupby(s).cumcount().unstack())
df
Out[86]: 
     1    2    3    4    5  6    7
1  NaN  1.0  2.0  3.0  NaN  1  2.0
2  NaN  NaN  1.0  2.0  3.0  4  5.0
3  NaN  NaN  NaN  1.0  NaN  1  2.0
4  3.0  4.0  NaN  NaN  1.0  2  NaN

Upvotes: 0

Kumar Gaurav
Kumar Gaurav

Reputation: 176

One of the way can be:

sample = pd.DataFrame({1:[np.nan,np.nan,np.nan,1],2:[1,np.nan,np.nan,1],3:[1,1,np.nan,np.nan],4:[1,1,1,np.nan],5:[np.nan,1,np.nan,1],6:[1,1,1,1],7:[1,1,1,np.nan]},index=[1,2,3,4])

Output of sample

        1        2       3       4       5      6    7
    1   NaN     1.0     1.0     1.0     NaN     1   1.0
    2   NaN     NaN     1.0     1.0     1.0     1   1.0
    3   NaN     NaN     NaN     1.0     NaN     1   1.0
    4   1.0     1.0     NaN     NaN     1.0     1   NaN

Following code would do:

    #numr = number of rows
    #numc = number of columns
    numr,numc = sample.shape
    for i in range(numr):
        s=0
        flag=0
        for j in range(numc):
            if np.isnan(sample.iloc[i,j]):
                flag=1
            else:
                if flag==1:
                    s=sample.iloc[i,j]
                    flag=0
                else:
                    s+=sample.iloc[i,j]
                    sample.iloc[i,j]=s

Output:

         1       2       3       4       5       6       7 
    1   NaN     1.0     2.0     3.0     NaN     1.0     2.0
    2   NaN     NaN     1.0     2.0     3.0     4.0     5.0
    3   NaN     NaN     NaN     1.0     NaN     1.0     2.0
    4   1.0     2.0     NaN     NaN     1.0     2.0     NaN

Upvotes: -1

Dani Mesejo
Dani Mesejo

Reputation: 61910

You could do:

# compute mask where np.nan = True
mask = pd.isna(df).astype(bool)

# compute cumsum across rows fillna with ffill
cumulative = df.cumsum(1).fillna(method='ffill', axis=1).fillna(0)

# get the values of cumulative where nan is True use the same method
restart = cumulative[mask].fillna(method='ffill', axis=1).fillna(0)

# set the result
result = (cumulative - restart)
result[mask] = np.nan

# display the result
print(result)

Output

     1    2    3    4    5    6    7
0  NaN  1.0  2.0  3.0  NaN  1.0  2.0
1  NaN  NaN  1.0  2.0  3.0  4.0  5.0
2  NaN  NaN  NaN  1.0  NaN  1.0  2.0
3  1.0  2.0  NaN  NaN  1.0  2.0  NaN

Upvotes: 6

Related Questions