Reputation: 3671
One additional note to address the problem better, in the actual data set there is also a column called store and the table can be grouped by store, date & product, When I tried the pivot solution and the cartesian product solution it did not work, is there a solution that could work for 3 grouping columns? Also the table has millions of rows.
Assuming a data frame with the following format:
d = {'product': ['a', 'b', 'c', 'a', 'b'], 'amount': [1, 2, 3, 5, 2], 'date': ['2020-6-6', '2020-6-6', '2020-6-6',
'2020-6-7', '2020-6-7']}
df = pd.DataFrame(data=d)
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
Product c is no longer present on the date 2020-6-7, I want to be able to calculate things like percent change or difference in the amount of each product.
For example: df['diff'] = df.groupby('product')['amount'].diff()
But in order for this to work and show for example that the difference of c is -3 and -100%, c would need to be present on the next date with the amount set to 0
This is the results I am looking for:
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
5 c 0 2020-6-7
Please note this is just a snipped data frame, in reality there might be many date periods, I am only looking to fill in the product and amount in the first date after it has been removed, not all dates after.
What is the best way to go about this?
Upvotes: 2
Views: 463
Reputation: 28644
You could use the complete function from pyjanitor to explicitly expose the missing values and combine with fillna to fill the missing values with 0:
# pip install pyjanitor
# import janitor
df.complete(['date', 'product']).fillna(0)
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
Upvotes: 1
Reputation: 23099
another way is to do create a cartesian product of your products & dates, then join that to your main dataframe to get the missing values.
#df['date'] = pd.to_datetime(df['date'])
#ensure you have a proper datetime object.
s = pd.merge( df[['product']].drop_duplicates().assign(ky=-1),
df[['date']].drop_duplicates().assign(ky=-1),
on=['ky']
).drop('ky',1)
df1 = pd.merge(df,s,
on = ['product','date']
,how='outer'
).fillna(0)
print(df1)
product amount date
0 a 1.0 2020-06-06
1 b 2.0 2020-06-06
2 c 3.0 2020-06-06
3 a 5.0 2020-06-07
4 b 2.0 2020-06-07
5 c 0.0 2020-06-07
Upvotes: 0
Reputation: 323226
Let us try pivot
then unstack
out = df.pivot('product','date','amount').fillna(0).unstack().reset_index(name='amount')
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
Upvotes: 3