Reputation: 4571
I'm working with data that has null values in between. I intend to construct a time-series plot out of cumulative sum of a particular column sales
. Conditions for cumulative sum on sales
: (1.) if first row is null, fillna(0)
, then cumsum()
so plot can always start from origin. (2.) if null rows follow each other to the end, leave as null else fillna(0)
:
data = {'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016,2017, 2018, 2019],
'quantity': [10, 21, 20, 10, 39, 30, 31,45, 23, 56],
'sales': [None, 41, None, None, 32, 0, 31,None, None, None]}
df = pd.DataFrame(data)
df = df.set_index('year')
df['cum_sales'] = df[['sales']].cumsum()
print df
df.plot()
How to apply conditions such that result becomes:
Upvotes: 1
Views: 4846
Reputation: 210982
UPDATE: Could you please give an example how to implement your solution as function?
def set_col_last_valid(df, col, new_col):
df.loc[(df.index <= df[col].last_valid_index()) & (pd.isnull(df[col])), col] = 0
df[new_col] = df[col].cumsum()
return df
In [174]: new = set_col_last_valid(df, 'sales', 'cum_sales')
In [175]: new
Out[175]:
quantity sales cum_sales
year
2010 10 0.0 0.0
2011 21 41.0 41.0
2012 20 0.0 41.0
2013 10 0.0 41.0
2014 39 32.0 73.0
2015 30 0.0 73.0
2016 31 31.0 104.0
2017 45 NaN NaN
2018 23 NaN NaN
2019 56 NaN NaN
Original answer:
you can do it this way:
first let's set to 0
all NaN's, except those at the end:
df.loc[(df.index <= df.sales.last_valid_index()) & (pd.isnull(df.sales)), 'sales'] = 0
now you can simply use cumsum()
In [142]: df.sales.cumsum()
Out[142]:
year
2010 0.0
2011 41.0
2012 41.0
2013 41.0
2014 73.0
2015 97.0
2016 128.0
2017 NaN
2018 NaN
2019 NaN
Name: sales, dtype: float64
Explanation:
In [154]: df.sales.last_valid_index()
Out[154]: 2016
In [155]: df.loc[df.index <= df.sales.last_valid_index()]
Out[155]:
quantity sales
year
2010 10 0.0
2011 21 41.0
2012 20 0.0
2013 10 0.0
2014 39 32.0
2015 30 24.0
2016 31 31.0
Upvotes: 1
Reputation: 294546
We're going to use a combination of bfill
and combine_first
. As we backfill the sales column, we fill in any missing values that are followed by non-missing data, thus satisfying condition (2). However, I use this only to identify those positions that do satisfy condition (2) because I'm then going to multiply these by zero and use it to fill in missing values. I didn't actually want to backfill the data. combine_first
takes the value from the first dataframe first, if its missing it will attempt to get it from the second dataframe. Therefore the missing values that don't satisfy condition 2 will get filled with zero and otherwise will remain missing.
A lot of talk for:
df.sales = df.sales.combine_first(df.sales.bfill() * 0)
df.cum_sales = df.sales.cumsum()
print df
quantity sales cum_sales
year
2010 10 0.0 0.0
2011 21 41.0 41.0
2012 20 0.0 41.0
2013 10 0.0 41.0
2014 39 32.0 73.0
2015 30 0.0 73.0
2016 31 31.0 104.0
2017 45 NaN NaN
2018 23 NaN NaN
2019 56 NaN NaN
The plot:
df.plot()
Upvotes: 4
Reputation: 1367
So I figured it'd be best to change data in dictionary before you make dataframe. Most optimal way to do this is to traverse list in reverse order ignoring all None
values until first numerical value is met. After that point all None
values should be changed to 0
.
This is not most pretty way to deal with this problem, but it's written in a way that is easy to read and understand what it does. I believe it'll be much better this way.
flag = False
for sale in data["sales"][::-1]:
if !flag:
if sale:
flag = True
else
if !sale:
sale = 0
Upvotes: 1