David
David

Reputation: 1256

Pandas: Difference to previous value

Given a Pandas Data Frame that looks like this

GROUP   VALUE    MASK
  1        5     false
  2       10     false
  2       20     false
  1        7     true
  3        17    false
  3        18    false
  1        100   false
  1        200   true

For every row where the MASK is true I would like to get the difference to the previous value within this group, i.e., the result should be

VALUE DIFF
  7      2
  200  100

How can I compute this in Pandas?

How can I compute this fast for a large data frame with about 2 million rows and 1 million groups?

Upvotes: 7

Views: 1794

Answers (2)

TurtleIzzy
TurtleIzzy

Reputation: 1047

Actually the bottleneck is groupby. You don't actually need to use groupby for this specific problem. To sort the dataframe by GROUP, perform diff on the sorted dataframe and filter by MASK should be okay. We must use kind='mergesort' to keep the order unchanged within the group before and after sorting,

Assume MASK is always False for the first element of each group ( since the first element is meaningless for difference operation ), you can use this

pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]

Performance Tests:

MAXN = 200000
GROUPS = 10000
df = pd.DataFrame({"GROUP": np.ceil(np.random.rand(MAXN)*GROUPS), "VALUE": np.ceil(np.random.rand(MAXN)*10000), "MASK":np.floor(np.random.rand(MAXN)*2).astype("bool")})

%timeit t1 = pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
# 1 loop, best of 3: 1.28 s per loop

%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#10 loops, best of 3: 63.1 ms per loop

#MAXN = 2000000
#GROUPS = 1000000
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#1 loop, best of 3: 1.24 s per loop

Upvotes: 3

piRSquared
piRSquared

Reputation: 294318

use groupby, diff, 'MASK'

pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()],
          axis=1, keys=['VALUE', 'DIFF'])[df.MASK]

enter image description here

Upvotes: 1

Related Questions