Reputation: 183
I have this DataFrame (this is just an example, not the real data):
In [1]: import pandas as pd
my_data = [{'client_id' : '001', 'items' : '10', 'month' : 'Jan'},
{'client_id' : '001', 'items' : '20', 'month' : 'Feb'},
{'client_id' : '001', 'items' : '30', 'month' : 'Mar'},
{'client_id' : '002', 'items' : '30', 'month' : 'Jan'},
{'client_id' : '002', 'items' : '20', 'month' : 'Feb'},
{'client_id' : '002', 'items' : '15', 'month' : 'Mar'},
{'client_id' : '003', 'items' : '10', 'month' : 'Jan'},
{'client_id' : '003', 'items' : '20', 'month' : 'Feb'},
{'client_id' : '003', 'items' : '15', 'month' : 'Mar'}]
df = pd.DataFrame(my_data)
In [2]: df
Out [2]:
client_id month items
0 001 Jan 10
1 001 Feb 20
2 001 Mar 30
3 002 Jan 30
4 002 Feb 20
5 002 Mar 15
6 003 Jan 10
7 003 Feb 20
8 003 Mar 15
What I want is to calculate the delta items bought for each pair of months. That is, for example, client '001' bought 10 more items in February (20) than in January (10). Client '002', bought -10 items (February 20, January 30). The final DataFrame would look like this:
In [3]: delta_df
Out [3]:
client_id delta_items_feb delta_items_mar
0 001 10 10
1 002 -10 -5
2 003 10 -5
Any thoughts on how to do it?
Upvotes: 4
Views: 3775
Reputation: 643
Easiest way is using Pandas .diff() built in function.
Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).
delta_dataframe = original_dataframe.diff()
In this case delta_dataframe will give you the change between rows of the original_dataframe.
Upvotes: 0
Reputation: 176840
Here's one way, using pivot_table
to first group the item count by client and month:
(I first cast the items
column to integers with df.items = df.items.astype(int)
)
>>> table = df.pivot_table(values='items', rows='client_id', cols='month')
>>> table = table[['Jan', 'Feb', 'Mar']]
>>> pd.DataFrame(np.diff(table.values),
columns=['delta_items_feb', 'delta_items_mar'],
index=table.index).reset_index()
client_id delta_items_feb delta_items_mar
0 001 10 10
1 002 -10 -5
2 003 10 -5
Note: in newer versions of pandas, use index
/columns
instead of rows
/cols
when creating the pivot table.
This:
np.diff
to calculate the difference between consecutive months and creates a new DataFrame with the desired column namesUpvotes: 1
Reputation: 609
not fancy but here's what works for me
#change 'items' from string to int
## use loc to avoid "slice" warning
df.loc[:,"items"] = df["items"].map(int)
# use pivot to make columns for each unique value in "month" column
dfp = df.pivot('client_id','month','items')
# calculate delta and put in a new column
dfp["dJF"] = dfp.Feb - dfp.Jan
gives
month Feb Jan Mar dJF
client_id
001 20 10 30 10
002 20 30 15 -10
003 20 10 15 10
Upvotes: 1
Reputation: 40973
Kudos for a very clearly formulated question. Group by client and calculate deltas for each group:
>>> df['deltas'] = df.groupby('client_id')\
.apply(lambda x: x['items'].astype(int).diff()).values
client_id items month deltas
0 001 10 Jan NaN
1 001 20 Feb 10
2 001 30 Mar 10
3 002 30 Jan NaN
4 002 20 Feb -10
5 002 15 Mar -5
6 003 10 Jan NaN
7 003 20 Feb 10
8 003 15 Mar -5
Finally take it to the form you want dropping the January column:
>>> df.pivot(index='client_id', columns='month', values='deltas')\
.drop('Jan', axis=1)
month Feb Mar
client_id
001 10 10
002 -10 -5
003 10 -5
Upvotes: 1
Reputation: 40
1) clietn_id to set. Set to list client_listand sorted ['001','002','003'] .
2) month string to int Jan-1;Feb-2;Mar -3 and etc
3) for client in client_listand:
For every client create new list
for line in you_date:
When ides of clients coincide, add to the list #filter by client_id
sorted result by month
in the loop from data of one client generate the lines of outgoing table.
delta_items_mar = item[n]-item[n-1]
Upvotes: -1