Reputation: 2530
Let's say I have this table in a DataFrame, with the dates several cars have been refilled:
+-------+-------------+
| carId | refill_date |
+-------+-------------+
| 1 | 2020-03-01 |
+-------+-------------+
| 1 | 2020-03-12 |
+-------+-------------+
| 1 | 2020-04-04 |
+-------+-------------+
| 2 | 2020-03-07 |
+-------+-------------+
| 2 | 2020-03-26 |
+-------+-------------+
| 2 | 2020-04-01 |
+-------+-------------+
I would like to add a 3rd column, time_elapsed
, with the duration between each refill.
+-------+-------------+--------------+
| carId | refill_date | time_elapsed |
+-------+-------------+--------------+
| 1 | 2020-03-01 | |
+-------+-------------+--------------+
| 1 | 2020-03-12 | 11 |
+-------+-------------+--------------+
| 1 | 2020-04-04 | 23 |
+-------+-------------+--------------+
| 2 | 2020-03-07 | |
+-------+-------------+--------------+
| 2 | 2020-03-26 | 19 |
+-------+-------------+--------------+
| 2 | 2020-04-01 | 6 |
+-------+-------------+--------------+
So here's what I do:
import pandas as pd
df = pd.DataFrame
data = [
{
"carId": 1,
"refill_date": "2020-3-1"
},
{
"carId": 1,
"refill_date": "2020-3-12"
},
{
"carId": 1,
"refill_date": "2020-4-4"
},
{
"carId": 2,
"refill_date": "2020-3-7"
},
{
"carId": 2,
"refill_date": "2020-3-26"
},
{
"carId": 2,
"refill_date": "2020-4-1"
}
]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])
for c in df['carId'].unique():
df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
'refill_date'].diff()
Which returns the expected results:
+---+-------+-------------+--------------+
| | carId | refill_date | time_elapsed |
+---+-------+-------------+--------------+
| 0 | 1 | 2020-03-01 | NaT |
+---+-------+-------------+--------------+
| 1 | 1 | 2020-03-12 | 11 days |
+---+-------+-------------+--------------+
| 2 | 1 | 2020-04-04 | 23 days |
+---+-------+-------------+--------------+
| 3 | 2 | 2020-03-07 | NaT |
+---+-------+-------------+--------------+
| 4 | 2 | 2020-03-26 | 19 days |
+---+-------+-------------+--------------+
| 5 | 2 | 2020-04-01 | 6 days |
+---+-------+-------------+--------------+
So, everything looks OK, but here's the catch: in my real-life instance, my dataframe contains 3.5 million rows, and processing takes ages, even though it's a completely numerical, in-memory calculation with "only" 1711 groups to loop through.
Is there an alternative and faster way?
Thanks!
Upvotes: 1
Views: 149
Reputation: 13413
using native pandas methods over a df.groupby
should give significant performance boost over a "native python" loop:
df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()
Here's a small benchmark (on my laptop, YMMV...) using 100 cars with 31 days each, showing an almost 10x performance boost:
import pandas as pd
import timeit
data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])
def original_method():
for c in df['carId'].unique():
df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
'refill_date'].diff()
def using_groupby():
df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()
time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)
print(time1)
print(time2)
print(time1/time2)
Output:
16.6183732
1.7910263000000022
9.278687420726307
Upvotes: 4
Reputation: 12140
You just need to use .groupby
:
df['time_elapsed'] = df.groupby('carId').diff()
output:
refill_date
0 NaT
1 11 days
2 23 days
3 NaT
4 19 days
5 6 days
Upvotes: 2
Reputation: 28729
Get the time_elapsed by using shift and subtracting from refill_date
(
df.assign(
refill_date=pd.to_datetime(df.refill_date),
time_shift=lambda x: x.groupby("carId").refill_date.shift(),
time_elapsed=lambda x: x.time_shift.sub(x.refill_date).abs(),
)
)
The other answers using diff are better, as that is more succinct, and I'd like to believe, faster.
Upvotes: 1