DevEx
DevEx

Reputation: 4571

Pandas conditional cumulative sum

I'm working with data that has null values in between. I intend to construct a time-series plot out of cumulative sum of a particular column sales. Conditions for cumulative sum on sales: (1.) if first row is null, fillna(0), then cumsum() so plot can always start from origin. (2.) if null rows follow each other to the end, leave as null else fillna(0):

data = {'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016,2017, 2018, 2019],
        'quantity': [10, 21, 20, 10, 39, 30, 31,45, 23, 56],
        'sales': [None, 41, None, None, 32, 0, 31,None, None, None]}

df = pd.DataFrame(data)
df = df.set_index('year')

df['cum_sales'] = df[['sales']].cumsum()

print df
df.plot()

enter image description here

How to apply conditions such that result becomes:

enter image description here

Upvotes: 1

Views: 4846

Answers (3)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

UPDATE: Could you please give an example how to implement your solution as function?

def set_col_last_valid(df, col, new_col):
    df.loc[(df.index <= df[col].last_valid_index()) & (pd.isnull(df[col])), col] = 0
    df[new_col] = df[col].cumsum()
    return df

In [174]: new = set_col_last_valid(df, 'sales', 'cum_sales')

In [175]: new
Out[175]:
      quantity  sales  cum_sales
year
2010        10    0.0        0.0
2011        21   41.0       41.0
2012        20    0.0       41.0
2013        10    0.0       41.0
2014        39   32.0       73.0
2015        30    0.0       73.0
2016        31   31.0      104.0
2017        45    NaN        NaN
2018        23    NaN        NaN
2019        56    NaN        NaN

Original answer:

you can do it this way:

first let's set to 0 all NaN's, except those at the end:

df.loc[(df.index <= df.sales.last_valid_index()) & (pd.isnull(df.sales)), 'sales'] = 0

now you can simply use cumsum()

In [142]: df.sales.cumsum()
Out[142]:
year
2010      0.0
2011     41.0
2012     41.0
2013     41.0
2014     73.0
2015     97.0
2016    128.0
2017      NaN
2018      NaN
2019      NaN
Name: sales, dtype: float64

Explanation:

In [154]: df.sales.last_valid_index()
Out[154]: 2016

In [155]: df.loc[df.index <= df.sales.last_valid_index()]
Out[155]:
      quantity  sales
year
2010        10    0.0
2011        21   41.0
2012        20    0.0
2013        10    0.0
2014        39   32.0
2015        30   24.0
2016        31   31.0

Upvotes: 1

piRSquared
piRSquared

Reputation: 294546

We're going to use a combination of bfill and combine_first. As we backfill the sales column, we fill in any missing values that are followed by non-missing data, thus satisfying condition (2). However, I use this only to identify those positions that do satisfy condition (2) because I'm then going to multiply these by zero and use it to fill in missing values. I didn't actually want to backfill the data. combine_first takes the value from the first dataframe first, if its missing it will attempt to get it from the second dataframe. Therefore the missing values that don't satisfy condition 2 will get filled with zero and otherwise will remain missing.

A lot of talk for:

df.sales = df.sales.combine_first(df.sales.bfill() * 0)

df.cum_sales = df.sales.cumsum()

print df

      quantity  sales  cum_sales
year                            
2010        10    0.0        0.0
2011        21   41.0       41.0
2012        20    0.0       41.0
2013        10    0.0       41.0
2014        39   32.0       73.0
2015        30    0.0       73.0
2016        31   31.0      104.0
2017        45    NaN        NaN
2018        23    NaN        NaN
2019        56    NaN        NaN

The plot:

df.plot()

enter image description here

Upvotes: 4

Tomasz Plaskota
Tomasz Plaskota

Reputation: 1367

So I figured it'd be best to change data in dictionary before you make dataframe. Most optimal way to do this is to traverse list in reverse order ignoring all None values until first numerical value is met. After that point all None values should be changed to 0.

This is not most pretty way to deal with this problem, but it's written in a way that is easy to read and understand what it does. I believe it'll be much better this way.

flag = False
for sale in data["sales"][::-1]:
    if !flag:
        if sale:
            flag = True
    else
        if !sale:
            sale = 0

Upvotes: 1

Related Questions