mrgou
mrgou

Reputation: 2530

A fast, efficient way to calculate time differences between groups of rows in pandas?

Let's say I have this table in a DataFrame, with the dates several cars have been refilled:

+-------+-------------+
| carId | refill_date |
+-------+-------------+
|     1 |  2020-03-01 |
+-------+-------------+
|     1 |  2020-03-12 |
+-------+-------------+
|     1 |  2020-04-04 |
+-------+-------------+
|     2 |  2020-03-07 |
+-------+-------------+
|     2 |  2020-03-26 |
+-------+-------------+
|     2 |  2020-04-01 |
+-------+-------------+

I would like to add a 3rd column, time_elapsed, with the duration between each refill.

+-------+-------------+--------------+
| carId | refill_date | time_elapsed |
+-------+-------------+--------------+
|     1 |  2020-03-01 |              |
+-------+-------------+--------------+
|     1 |  2020-03-12 |           11 |
+-------+-------------+--------------+
|     1 |  2020-04-04 |           23 |
+-------+-------------+--------------+
|     2 |  2020-03-07 |              |
+-------+-------------+--------------+
|     2 |  2020-03-26 |           19 |
+-------+-------------+--------------+
|     2 |  2020-04-01 |            6 |
+-------+-------------+--------------+

So here's what I do:

import pandas as pd
df = pd.DataFrame

data = [
    {
        "carId": 1,
        "refill_date": "2020-3-1"
    },
    {
        "carId": 1,
        "refill_date": "2020-3-12"
    },
    {
        "carId": 1,
        "refill_date": "2020-4-4"
    },
    {
        "carId": 2,
        "refill_date": "2020-3-7"
    },
    {
        "carId": 2,
        "refill_date": "2020-3-26"
    },
    {
        "carId": 2,
        "refill_date": "2020-4-1"
    }
]

df = pd.DataFrame(data)

df['refill_date'] = pd.to_datetime(df['refill_date'])

for c in df['carId'].unique():
    df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                      'refill_date'].diff()

Which returns the expected results:

+---+-------+-------------+--------------+
|   | carId | refill_date | time_elapsed |
+---+-------+-------------+--------------+
| 0 |     1 |  2020-03-01 |          NaT |
+---+-------+-------------+--------------+
| 1 |     1 |  2020-03-12 |      11 days |
+---+-------+-------------+--------------+
| 2 |     1 |  2020-04-04 |      23 days |
+---+-------+-------------+--------------+
| 3 |     2 |  2020-03-07 |          NaT |
+---+-------+-------------+--------------+
| 4 |     2 |  2020-03-26 |      19 days |
+---+-------+-------------+--------------+
| 5 |     2 |  2020-04-01 |       6 days |
+---+-------+-------------+--------------+

So, everything looks OK, but here's the catch: in my real-life instance, my dataframe contains 3.5 million rows, and processing takes ages, even though it's a completely numerical, in-memory calculation with "only" 1711 groups to loop through.

Is there an alternative and faster way?

Thanks!

Upvotes: 1

Views: 149

Answers (3)

Adam.Er8
Adam.Er8

Reputation: 13413

using native pandas methods over a df.groupby should give significant performance boost over a "native python" loop:

df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

Here's a small benchmark (on my laptop, YMMV...) using 100 cars with 31 days each, showing an almost 10x performance boost:

import pandas as pd
import timeit

data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])

def original_method():
    for c in df['carId'].unique():
        df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                          'refill_date'].diff()

def using_groupby():
    df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)

print(time1)
print(time2)
print(time1/time2)

Output:

16.6183732
1.7910263000000022
9.278687420726307

Upvotes: 4

Yevhen Kuzmovych
Yevhen Kuzmovych

Reputation: 12140

You just need to use .groupby:

df['time_elapsed'] = df.groupby('carId').diff()

output:

  refill_date
0         NaT
1     11 days
2     23 days
3         NaT
4     19 days
5      6 days

Upvotes: 2

sammywemmy
sammywemmy

Reputation: 28729

Get the time_elapsed by using shift and subtracting from refill_date

(
    df.assign(
        refill_date=pd.to_datetime(df.refill_date),
        time_shift=lambda x: x.groupby("carId").refill_date.shift(),
        time_elapsed=lambda x: x.time_shift.sub(x.refill_date).abs(),
    )
)

The other answers using diff are better, as that is more succinct, and I'd like to believe, faster.

Upvotes: 1

Related Questions