Andres
Andres

Reputation: 183

Calculate delta from values in dataframe

I have this DataFrame (this is just an example, not the real data):

In [1]: import pandas as pd
        my_data = [{'client_id' : '001', 'items' : '10', 'month' : 'Jan'},
                   {'client_id' : '001', 'items' : '20', 'month' : 'Feb'},
                   {'client_id' : '001', 'items' : '30', 'month' : 'Mar'},
                   {'client_id' : '002', 'items' : '30', 'month' : 'Jan'},
                   {'client_id' : '002', 'items' : '20', 'month' : 'Feb'},
                   {'client_id' : '002', 'items' : '15', 'month' : 'Mar'},
                   {'client_id' : '003', 'items' : '10', 'month' : 'Jan'},
                   {'client_id' : '003', 'items' : '20', 'month' : 'Feb'},
                   {'client_id' : '003', 'items' : '15', 'month' : 'Mar'}]
        df = pd.DataFrame(my_data)

In  [2]: df
Out [2]:    
            client_id   month        items
         0        001     Jan           10
         1        001     Feb           20
         2        001     Mar           30
         3        002     Jan           30
         4        002     Feb           20
         5        002     Mar           15
         6        003     Jan           10
         7        003     Feb           20
         8        003     Mar           15

What I want is to calculate the delta items bought for each pair of months. That is, for example, client '001' bought 10 more items in February (20) than in January (10). Client '002', bought -10 items (February 20, January 30). The final DataFrame would look like this:

In [3]: delta_df
Out [3]:   
            client_id   delta_items_feb   delta_items_mar
        0         001                10                10
        1         002               -10                -5
        2         003                10                -5

Any thoughts on how to do it?

Upvotes: 4

Views: 3775

Answers (5)

PJ_
PJ_

Reputation: 643

Easiest way is using Pandas .diff() built in function.

Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).

delta_dataframe = original_dataframe.diff()

In this case delta_dataframe will give you the change between rows of the original_dataframe.

Upvotes: 0

Alex Riley
Alex Riley

Reputation: 176840

Here's one way, using pivot_table to first group the item count by client and month:

(I first cast the items column to integers with df.items = df.items.astype(int))

>>> table = df.pivot_table(values='items', rows='client_id', cols='month')
>>> table = table[['Jan', 'Feb', 'Mar']]
>>> pd.DataFrame(np.diff(table.values), 
                 columns=['delta_items_feb', 'delta_items_mar'],
                 index=table.index).reset_index()

  client_id  delta_items_feb  delta_items_mar
0       001               10               10
1       002              -10               -5
2       003               10               -5

Note: in newer versions of pandas, use index/columns instead of rows/cols when creating the pivot table.

This:

  • pivots the data by client and date to show the item counts for each
  • makes sure the columns in the table are sorted properly by month
  • uses np.diff to calculate the difference between consecutive months and creates a new DataFrame with the desired column names

Upvotes: 1

JMann
JMann

Reputation: 609

not fancy but here's what works for me

#change 'items' from string to int
## use loc to avoid "slice" warning
df.loc[:,"items"] = df["items"].map(int)

# use pivot to make columns for each unique value in "month" column
dfp = df.pivot('client_id','month','items')

# calculate delta and put in a new column 
dfp["dJF"] = dfp.Feb - dfp.Jan

gives

month     Feb Jan Mar  dJF
client_id                 
001        20  10  30   10
002        20  30  15  -10
003        20  10  15   10

Upvotes: 1

elyase
elyase

Reputation: 40973

Kudos for a very clearly formulated question. Group by client and calculate deltas for each group:

>>> df['deltas'] = df.groupby('client_id')\
                     .apply(lambda x: x['items'].astype(int).diff()).values

  client_id  items month  deltas
0       001     10   Jan     NaN
1       001     20   Feb      10
2       001     30   Mar      10
3       002     30   Jan     NaN
4       002     20   Feb     -10
5       002     15   Mar      -5
6       003     10   Jan     NaN
7       003     20   Feb      10
8       003     15   Mar      -5

Finally take it to the form you want dropping the January column:

>>> df.pivot(index='client_id', columns='month', values='deltas')\
      .drop('Jan', axis=1)

month       Feb  Mar
client_id       
001         10  10
002        -10  -5
003         10  -5

Upvotes: 1

Evgeny Ivanov
Evgeny Ivanov

Reputation: 40

1) clietn_id to  set. Set to list client_listand sorted  ['001','002','003'] .
2) month string to int Jan-1;Feb-2;Mar -3 and etc
3)  for client in client_listand:
    For every client create new list
    for line in you_date:
        When ides of clients coincide, add to the list #filter by client_id
     sorted result by month
     in the loop from data of one client generate the lines of outgoing table. 
     delta_items_mar = item[n]-item[n-1]

Upvotes: -1

Related Questions