Mustard Tiger
Mustard Tiger

Reputation: 3671

Pandas fill row values using previous period

One additional note to address the problem better, in the actual data set there is also a column called store and the table can be grouped by store, date & product, When I tried the pivot solution and the cartesian product solution it did not work, is there a solution that could work for 3 grouping columns? Also the table has millions of rows.

Assuming a data frame with the following format:

d = {'product': ['a', 'b', 'c', 'a', 'b'], 'amount': [1, 2, 3, 5, 2], 'date': ['2020-6-6', '2020-6-6', '2020-6-6', 
'2020-6-7', '2020-6-7']}
df = pd.DataFrame(data=d)

print(df)
 product  amount  date
0     a     1      2020-6-6
1     b     2      2020-6-6
2     c     3      2020-6-6
3     a     5      2020-6-7
4     b     2      2020-6-7

Product c is no longer present on the date 2020-6-7, I want to be able to calculate things like percent change or difference in the amount of each product.

For example: df['diff'] = df.groupby('product')['amount'].diff()

But in order for this to work and show for example that the difference of c is -3 and -100%, c would need to be present on the next date with the amount set to 0

This is the results I am looking for:

print(df)
 product  amount  date
0     a     1      2020-6-6
1     b     2      2020-6-6
2     c     3      2020-6-6
3     a     5      2020-6-7
4     b     2      2020-6-7
5     c     0      2020-6-7

Please note this is just a snipped data frame, in reality there might be many date periods, I am only looking to fill in the product and amount in the first date after it has been removed, not all dates after.

What is the best way to go about this?

Upvotes: 2

Views: 463

Answers (3)

sammywemmy
sammywemmy

Reputation: 28644

You could use the complete function from pyjanitor to explicitly expose the missing values and combine with fillna to fill the missing values with 0:

# pip install pyjanitor
# import janitor
df.complete(['date', 'product']).fillna(0)

       date product  amount
0  2020-6-6       a     1.0
1  2020-6-6       b     2.0
2  2020-6-6       c     3.0
3  2020-6-7       a     5.0
4  2020-6-7       b     2.0
5  2020-6-7       c     0.0

Upvotes: 1

Umar.H
Umar.H

Reputation: 23099

another way is to do create a cartesian product of your products & dates, then join that to your main dataframe to get the missing values.

#df['date'] = pd.to_datetime(df['date'])
#ensure you have a proper datetime object.

s = pd.merge( df[['product']].drop_duplicates().assign(ky=-1), 
         df[['date']].drop_duplicates().assign(ky=-1),
         on=['ky']
        ).drop('ky',1)

df1 = pd.merge(df,s,
         on = ['product','date']
         ,how='outer'
).fillna(0)

print(df1)

  product  amount       date
0       a     1.0 2020-06-06
1       b     2.0 2020-06-06
2       c     3.0 2020-06-06
3       a     5.0 2020-06-07
4       b     2.0 2020-06-07
5       c     0.0 2020-06-07

Upvotes: 0

BENY
BENY

Reputation: 323226

Let us try pivot then unstack

out = df.pivot('product','date','amount').fillna(0).unstack().reset_index(name='amount')
       date product  amount
0  2020-6-6       a     1.0
1  2020-6-6       b     2.0
2  2020-6-6       c     3.0
3  2020-6-7       a     5.0
4  2020-6-7       b     2.0
5  2020-6-7       c     0.0

Upvotes: 3

Related Questions