Pandas Improving Efficiency

Question

I have a pandas dataframe with approximately 3 million rows. I want to partially aggregate the last column in seperate spots based on another variable.

My solution was to separate the dataframe rows into a list of new dataframes based on that variable, aggregate the dataframes, and then join them again into a single dataframe. The problem is that after a few 10s of thousands of rows, I get a memory error. What methods can I use to improve the efficiency of my function to prevent these memory errors?

An example of my code is below

test = pd.DataFrame({"unneeded_var": [6,6,6,4,2,6,9,2,3,3,1,4,1,5,9],
                     "year": [0,0,0,0,1,1,1,2,2,2,2,3,3,3,3], 
                     "month" : [0,0,0,0,1,1,1,2,2,2,3,3,3,4,4],
                     "day" : [0,0,0,1,1,1,2,2,2,2,3,3,4,4,5], 
                     "day_count" : [7,4,3,2,1,5,4,2,3,2,5,3,2,1,3]})
test = test[["year", "month", "day", "day_count"]]

def agg_multiple(df, labels, aggvar, repl=None):
    if(repl is None): repl = aggvar
    conds = df.duplicated(labels).tolist() #returns boolean list of false for a unique (year,month) then true until next unique pair
    groups = []
    start = 0
    for i in range(len(conds)): #When false, split previous to new df, aggregate count 
        bul = conds[i]
        if(i == len(conds) - 1): i +=1 #no false marking end of last group, special case
        if not bul and i > 0 or bul and i == len(conds): 
            sample = df.iloc[start:i , :]
            start = i
            sample = sample.groupby(labels, as_index=False).agg({aggvar:sum}).rename(columns={aggvar : repl})
            groups.append(sample)
    df = pd.concat(groups).reset_index(drop=True) #combine aggregated dfs into new df
    return df



test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")

I suppose that I could potentially apply the function to small samples of the dataframe, to prevent a memory error and then combine those, but I'd rather improve the computation time of my function.

sacuL · Accepted Answer

There are almost always pandas methods that are pretty optimized for tasks that will vastly outperform iteration through the dataframe. If I understand correctly, in your case, the following will return the same exact output as your function:

test2 = (test.groupby(['year', 'month'])
         .day_count.sum()
         .to_frame('month_count')
         .reset_index())

>>> test2
   year  month  month_count
0     0      0           16
1     1      1           10
2     2      2            7
3     2      3            5
4     3      3            5
5     3      4            4

To check that it's the same:

# Your original function:
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")

>>> test == test2
   year  month  month_count
0  True   True         True
1  True   True         True
2  True   True         True
3  True   True         True
4  True   True         True
5  True   True         True

Pandas Improving Efficiency

Answers (2)

Related Questions