Reputation: 39
I have a DataFrame that contains a list of thousands of, let's say, cars. Each car has a Start year column and an End year column, which represents the interval of years the car has circulated. Each car also has an average fuel consumption column for the whole period, like so:
df_cars
+-----+------------+----------+--------------------------+
| Car | Start year | End year | Average fuel consumption |
+-----+------------+----------+--------------------------+
| 1 | 2002 | 2025 | 10.0 |
+-----+------------+----------+--------------------------+
| 2 | 1995 | 2008 | 12.5 |
+-----+------------+----------+--------------------------+
| 3 | 2005 | 2017 | 8.5 |
+-----+------------+----------+--------------------------+
For a period between 2000 and 2015, I would like to obtain an average fuel consumption of the fleet of cars operating each year "Y". Hence, if a car has a Start year < Y and an End year > Y, it should be included in the average of the fleet for that given year. Obviously, most cars will appear in the fleet average of several years.
As of now, I am doing a loop, but it is rather slow.
for y in range(2000, 2015):
df_cars[(df_cars["Start year"]<=int(y))&(df_cars["End year"]>=int(y))]["Average fuel consumption"].mean(axis=0)
Is there a faster way of doing this? Thank you.
Upvotes: 1
Views: 1187
Reputation: 164623
Here is one way using collections.Counter
.
It is not vectorised, but still appears to offer a 13x performance improvement.
There is a pandonic but still loopy alternative here, but I cannot find a vectorised implementation.
import pandas as pd
from collections import Counter
df = pd.DataFrame([[1, 2002, 2025, 10.0],
[2, 1995, 2008, 12.5],
[3, 2005, 2017, 8.5]],
columns=['Car', 'StartYear', 'EndYear', 'AvgFuelConsumption'])
def jp(df):
# first get range of years
year_range = range(df['StartYear'].min(), df['EndYear'].max()+1)
res = pd.DataFrame(index=year_range, columns=['AvgFuelConsumption'])
# use collections.Counter for sums and counts
c_sum = Counter()
c_count = Counter()
for idx, car, start, end, fuel in df.itertuples():
for i in range(start, end+1):
c_sum[i] += fuel
c_count[i] += 1
# calculate averages by year
c_res = {y: c_sum[y] / c_count[y] for y in c_sum}
# create dataframe from dictionary
res = pd.DataFrame.from_dict(c_res, orient='index')
return res
def original(df):
res = pd.DataFrame(index=range(2000, 2026), columns=['AvgFuelConsumption'])
for y in range(2000, 2026):
res.loc[y, 'AvgFuelConsumption'] = df[(df["StartYear"]<=int(y))&(df["EndYear"]>=int(y))]["AvgFuelConsumption"].mean(axis=0)
return res
%timeit jp(df) # 4.17ms
%timeit original(df) # 54.8ms
Upvotes: 1