Stat_prob_001
Stat_prob_001

Reputation: 181

Alternative of for loop on a column subsetting in pandas

I am stuck at a time consumption related problem in pandas:

The code looks like this:

df = pd.DataFrame({"IDs": [1, 1, 1, 2, 2, 2, 3, 3, 3],
                   "Month": ["01", "02", "01", "01", "02", "01", "01", "02", "01"],
                   "column1": [0.9, 0.5, 0.3, 0.8, 0.5, 0.1, 0.6, 0.2, 0.8]})

df_list = []
for id in df.IDs.unique():
    temp = df[df.IDs == id]
    temp = temp.groupby("Month").mean()
    temp2 = temp['column1'].ewm(span=3, adjust=True).sum()
    df_list.append(temp2)

Note that unique IDs contains around 500k elements, the original dataframe df contains around 6 mil records.

Now I used tqdm to check the estimated time and it's taking 14-15 hrs to complete. If I have even loop only for temp = df[df.IDs == id] line and the estimated time is same (basically those are pandas function, so it should not make any performance issue). So the problem lies in that line.

Is there any alternative way to do this? Thanks for any suggestion.

Upvotes: 2

Views: 115

Answers (1)

BeRT2me
BeRT2me

Reputation: 13242

out = (df.groupby(['IDs', 'Month'])
         .mean()['column1']
         .ewm(span=3, adjust=True)
         .sum())
print(out.reset_index())

Output:

   IDs Month  column1
0    1    01  0.60000
1    1    02  0.80000
2    2    01  0.85000
3    2    02  0.92500
4    3    01  1.16250
5    3    02  0.78125

out = (df.groupby('IDs')
         .apply(lambda x: x.groupby('Month')
                           .mean()['column1']
                           .ewm(span=3, adjust=True)
                           .sum())
         .stack())
print(out.reset_index(name='column1'))

Output:

   IDs Month  column1
0    1    01    0.600
1    1    02    0.800
2    2    01    0.450
3    2    02    0.725
4    3    01    0.700
5    3    02    0.550

Upvotes: 1

Related Questions