gunesevitan
gunesevitan

Reputation: 965

Pandas - Mapping dict with multiple index to column

I have two large data sets which I can't do the aggregations by combining two dataframes. I have to do the aggregation on df_train first, then map the values to the df_test.

df_train and df_test have the same exact id1 and id2, but the df_test have more samples. I'm computing the target mean on id1 and id2 and store it as a dictionary for memory issues.

target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().to_dict()

This is the output of the aggregation. The keys are tuple pairs with id1 as the first element and id2 as the second element, and the values are target means of the groups.

{(0, 0): 146.45497131347656,
 (1, 0): 74.86539459228516,
 (2, 0): 14.551384925842285,
 (3, 0): 235.5499725341797,
 (4, 0): 976.5567626953125,
 (5, 0): 17.894445419311523,
 (6, 0): 64.06660461425781,
 (7, 0): 350.33416748046875,
 (7, 1): 3097.043701171875,
 (8, 0): 256.92779541015625,
 (9, 0): 72.7147445678711 }

How can I map those values to id1 and id2 columns properly?

(There are 60 million rows of data, 1449 id1 and 4 id2 values, so speed is important)

EDIT:

df_train[['id1', 'id2']].map(target_mean)

I tried this, but map is only supported by pd.Series.

Upvotes: 2

Views: 881

Answers (1)

jezrael
jezrael

Reputation: 862661

I think better is use DataFrame.join here:

target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().rename('avg')

df_test = df_test.join(target_mean, on=['id1', 'id2'])

Your solution is possible, but I guess slowier with map by MultiIndex:

target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().to_dict()
df_test['avg'] = df_test.set_index(['id1', 'id2']).index.map(target_mean)

Upvotes: 3

Related Questions