Reputation: 214
I have a df (df1
) containing no more than 10 rows. It displays two reference columns (Val1, Val2
) and a third column (Item
) that represents an important string. I have a second data frame that contains the same reference columns (Val1, Val2
) but more than 5 million rows. I want to map Item
from the initial df to the second data frame in an efficient manner.
I've tried merge
or setting the index and using join
. Dask
did not speed up the process either. Is there a separate method I can use instead of merging/joining the two data frames?
df1 = pd.DataFrame({
'Val1' : [1.0,1.0,2.0,2.0],
'Val2' : ['Red','Blue','Red','Blue'],
'Item' : ['Up','Down','Down','Up'],
})
df2 = pd.DataFrame({
'Val1' : [1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0],
'Val2' : ['Red','Blue','Red','Blue','Red','Blue','Red','Blue'],
})
df1.set_index(['Val1','Val2'], inplace=True)
df2.set_index(['Val1','Val2'], inplace=True)
df_final = df2.join(df1, how = 'left').reset_index()
Upvotes: 1
Views: 348
Reputation: 6333
In my experience, integer keys result in faster merge times. For this, you can map Val1
and Val2
to integers on both dataframes (df1
and df2
) and then merge by Val1
and Val2
.
I'm sure there's a more efficient way to map Val1
and Val2
to integers, but the purpose of this answer is to show that merging on integers is faster.
# Turn Val1 and Val2 to categorical dtypes
df2[['Val1','Val2']] = df2[['Val1','Val2']].apply(lambda x: pd.Categorical(x))
# Turn categories to dictionaries
d1 = dict(enumerate(df2['Val1'].cat.categories))
d2 = dict(enumerate(df2['Val2'].cat.categories))
# Reverse keys and values in each dictionary
d1 = {v:k for k,v in d1.items()}
d2 = {v:k for k,v in d2.items()}
# Replace columns in df1
df1['Val1'].replace(d1, inplace=True)
df2['Val1'].replace(d1, inplace=True)
df1['Val2'].replace(d2, inplace=True)
df2['Val2'].replace(d2, inplace=True)
# Merge by integer versions of Val1 and Val2
df2.merge(df1, on=['Val1','Val2'], how='left')
Comparison of execution times:
# Merge with original keys
start = time.time()
df2.merge(df1, on=['Val1','Val2'], how='left')
round(time.time() - start, 5)
0.00672
# Merge with integer keys
start = time.time()
df2.merge(df1, on=['Val1','Val2'])
round(time.time() - start, 5)
0.00485
Upvotes: 1