Jivan
Jivan

Reputation: 23068

Pandas — match last identical row and compute difference

With a DataFrame like the following:

         timestamp    value
0       2012-01-01      3.0
1       2012-01-05      3.0
2       2012-01-06      6.0
3       2012-01-09      3.0
4       2012-01-31      1.0
5       2012-02-09      3.0
6       2012-02-11      1.0
7       2012-02-13      3.0
8       2012-02-15      2.0
9       2012-02-18      5.0

What would be an elegant and efficient way to add a time_since_last_identical column, so that the previous example would result in:

         timestamp    value   time_since_last_identical
0       2012-01-01      3.0                         NaT
1       2012-01-05      3.0                      5 days
2       2012-01-06      6.0                         NaT
3       2012-01-09      3.0                      4 days
4       2012-01-31      1.0                         NaT
5       2012-02-09      3.0                     31 days
6       2012-02-11      1.0                     10 days
7       2012-02-13      3.0                      4 days
8       2012-02-15      2.0                         NaT
9       2012-02-18      5.0                         NaT

The important part of the problem is not necessarily the usage of time delays. Any solution that matches one particular row with the previous row of identical value, and computes something out of those two rows (here, a difference) will be valid.

Note: not interested in apply or loop-based approaches.

Upvotes: 1

Views: 41

Answers (2)

FLab
FLab

Reputation: 7476

Here is a solution using pandas groupby:

out = df.groupby(df['value'])\
        .apply(lambda x: pd.to_datetime(x['timestamp'], format = "%Y-%m-%d").diff())\
        .reset_index(level = 0, drop = False)\
        .reindex(df.index)\
        .rename(columns = {'timestamp' : 'time_since_last_identical'})

out = pd.concat([df['timestamp'], out], axis = 1)

That gives the following output:

    timestamp  value  time_since_last_identical
0  2012-01-01    3.0        NaT
1  2012-01-05    3.0     4 days
2  2012-01-06    6.0        NaT
3  2012-01-09    3.0     4 days
4  2012-01-31    1.0        NaT
5  2012-02-09    3.0    31 days
6  2012-02-11    1.0    11 days
7  2012-02-13    3.0     4 days
8  2012-02-15    2.0        NaT
9  2012-02-18    5.0        NaT

It does not exactly match your desired output, but I guess it is a matter of conventions (e.g. whether to include current day or not). Happy to refine if you provide more details.

Upvotes: 0

VinceP
VinceP

Reputation: 2163

A simple, clean and elegant groupby will do the trick:

df['time_since_last_identical'] = df.groupby('value').diff()

Gives:

   timestamp  value  time_since_last_identical
0 2012-01-01    3.0                        NaT
1 2012-01-05    3.0                     4 days
2 2012-01-06    6.0                        NaT
3 2012-01-09    3.0                     4 days  
4 2012-01-31    1.0                        NaT
5 2012-02-09    3.0                    31 days
6 2012-02-11    1.0                    11 days
7 2012-02-13    3.0                     4 days
8 2012-02-15    2.0                        NaT
9 2012-02-18    5.0                        NaT

Upvotes: 2

Related Questions