David Phooter
David Phooter

Reputation: 31

How to use the diff() function in pandas but enter the difference values in a new column?

I have a dataframe df:

df

      x-value 
frame 
    1     15  
    2     20 
    3     19

How can I get:

df

      x-value   delta-x 
frame 
    1     15      0
    2     20      5
    3     19     -1

Upvotes: 1

Views: 1802

Answers (1)

piRSquared
piRSquared

Reputation: 294318

Not to say there is anything wrong with what @Wen posted as a comment, but I want to post a more complete answer.

The Problem
There are 3 things going on that need to be addressed:

  1. Calculating the values that are the differences from one row to the next.
  2. Handling the fact that the "difference" will be one less value than the original length of the dataframe and we'll have to fill in a value for the missing bit.
  3. How do we assign this to a new column.

Option #1
The most natural way to do the diff would be to use pd.Series.diff (as @Wen suggested). But in order to produce the stated results, which are integers, I recommend using the pd.Series.fillna parameter, downcast='infer'. Finally, I don't like editing the dataframe unless there is a need for it. So I use pd.DataFrame.assign to produce a new dataframe that is a copy of the old one with a new column associated.

df.assign(**{'delta-x': df['x-value'].diff().fillna(0, downcast='infer')})

       x-value  delta-x
frame                  
1           15        0
2           20        5
3           19       -1

Option #2

Similar to #1 but I'll use numpy.diff to preserve int type in addition to picking up some performance.

df.assign(**{'delta-x': np.append(0, np.diff(df['x-value'].values))})

       x-value  delta-x
frame                  
1           15        0
2           20        5
3           19       -1

Testing

pir1 = lambda d: d.assign(**{'delta-x': d['x-value'].diff().fillna(0, downcast='infer')})
pir2 = lambda d: d.assign(**{'delta-x': np.append(0, np.diff(d['x-value'].values))})

res = pd.DataFrame(
    index=[10, 300, 1000, 3000, 10000, 30000],
    columns=['pir1', 'pir2'], dtype=float)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=1000)

res.plot(loglog=True)

res.div(res.min(1), 0)

enter image description here

           pir1  pir2
10     2.069498   1.0
300    2.123017   1.0
1000   2.397373   1.0
3000   2.804214   1.0
10000  4.559525   1.0
30000  7.058344   1.0

Upvotes: 4

Related Questions