Reputation: 965
I have two large data sets which I can't do the aggregations by combining two dataframes. I have to do the aggregation on df_train
first, then map the values to the df_test
.
df_train
and df_test
have the same exact id1
and id2
, but the df_test
have more samples. I'm computing the target mean on id1
and id2
and store it as a dictionary for memory issues.
target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().to_dict()
This is the output of the aggregation. The keys are tuple pairs with id1
as the first element and id2
as the second element, and the values are target means of the groups.
{(0, 0): 146.45497131347656,
(1, 0): 74.86539459228516,
(2, 0): 14.551384925842285,
(3, 0): 235.5499725341797,
(4, 0): 976.5567626953125,
(5, 0): 17.894445419311523,
(6, 0): 64.06660461425781,
(7, 0): 350.33416748046875,
(7, 1): 3097.043701171875,
(8, 0): 256.92779541015625,
(9, 0): 72.7147445678711 }
How can I map those values to id1
and id2
columns properly?
(There are 60 million rows of data, 1449 id1
and 4 id2
values, so speed is important)
EDIT:
df_train[['id1', 'id2']].map(target_mean)
I tried this, but map
is only supported by pd.Series
.
Upvotes: 2
Views: 881
Reputation: 862661
I think better is use DataFrame.join
here:
target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().rename('avg')
df_test = df_test.join(target_mean, on=['id1', 'id2'])
Your solution is possible, but I guess slowier with map
by MultiIndex
:
target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().to_dict()
df_test['avg'] = df_test.set_index(['id1', 'id2']).index.map(target_mean)
Upvotes: 3