Romain Sacchi
Romain Sacchi

Reputation: 39

Pandas Dataframe rows average based on value interval

I have a DataFrame that contains a list of thousands of, let's say, cars. Each car has a Start year column and an End year column, which represents the interval of years the car has circulated. Each car also has an average fuel consumption column for the whole period, like so:

df_cars
+-----+------------+----------+--------------------------+
| Car | Start year | End year | Average fuel consumption |
+-----+------------+----------+--------------------------+
| 1   | 2002       | 2025     | 10.0                     |
+-----+------------+----------+--------------------------+
| 2   | 1995       | 2008     | 12.5                     |
+-----+------------+----------+--------------------------+
| 3   | 2005       | 2017     | 8.5                      |
+-----+------------+----------+--------------------------+

For a period between 2000 and 2015, I would like to obtain an average fuel consumption of the fleet of cars operating each year "Y". Hence, if a car has a Start year < Y and an End year > Y, it should be included in the average of the fleet for that given year. Obviously, most cars will appear in the fleet average of several years.

As of now, I am doing a loop, but it is rather slow.

for y in range(2000, 2015):
    df_cars[(df_cars["Start year"]<=int(y))&(df_cars["End year"]>=int(y))]["Average fuel consumption"].mean(axis=0)

Is there a faster way of doing this? Thank you.

Upvotes: 1

Views: 1187

Answers (1)

jpp
jpp

Reputation: 164623

Here is one way using collections.Counter.

It is not vectorised, but still appears to offer a 13x performance improvement.

There is a pandonic but still loopy alternative here, but I cannot find a vectorised implementation.

import pandas as pd
from collections import Counter

df = pd.DataFrame([[1, 2002, 2025, 10.0],
                   [2, 1995, 2008, 12.5],
                   [3, 2005, 2017, 8.5]],
                  columns=['Car', 'StartYear', 'EndYear', 'AvgFuelConsumption'])

def jp(df):

    # first get range of years
    year_range = range(df['StartYear'].min(), df['EndYear'].max()+1)

    res = pd.DataFrame(index=year_range, columns=['AvgFuelConsumption'])

    # use collections.Counter for sums and counts
    c_sum = Counter()
    c_count = Counter()
    for idx, car, start, end, fuel in df.itertuples():
        for i in range(start, end+1):
            c_sum[i] += fuel
            c_count[i] += 1

    # calculate averages by year
    c_res = {y: c_sum[y] / c_count[y] for y in c_sum}

    # create dataframe from dictionary
    res = pd.DataFrame.from_dict(c_res, orient='index')

    return res

def original(df):

    res = pd.DataFrame(index=range(2000, 2026), columns=['AvgFuelConsumption'])

    for y in range(2000, 2026):
        res.loc[y, 'AvgFuelConsumption'] = df[(df["StartYear"]<=int(y))&(df["EndYear"]>=int(y))]["AvgFuelConsumption"].mean(axis=0)

    return res

%timeit jp(df)        # 4.17ms
%timeit original(df)  # 54.8ms

Upvotes: 1

Related Questions