Chopin
Chopin

Reputation: 214

Quicker method to map values - pandas

I have a df (df1) containing no more than 10 rows. It displays two reference columns (Val1, Val2) and a third column (Item) that represents an important string. I have a second data frame that contains the same reference columns (Val1, Val2) but more than 5 million rows. I want to map Item from the initial df to the second data frame in an efficient manner.

I've tried merge or setting the index and using join. Dask did not speed up the process either. Is there a separate method I can use instead of merging/joining the two data frames?

df1 = pd.DataFrame({  
    'Val1' : [1.0,1.0,2.0,2.0],    
    'Val2' : ['Red','Blue','Red','Blue'],       
    'Item' : ['Up','Down','Down','Up'],                                  
    })

df2 = pd.DataFrame({  
    'Val1' : [1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0],  
    'Val2' : ['Red','Blue','Red','Blue','Red','Blue','Red','Blue'],                                      
    })

df1.set_index(['Val1','Val2'], inplace=True)
df2.set_index(['Val1','Val2'], inplace=True)

df_final = df2.join(df1, how = 'left').reset_index()

Upvotes: 1

Views: 348

Answers (1)

Arturo Sbr
Arturo Sbr

Reputation: 6333

In my experience, integer keys result in faster merge times. For this, you can map Val1 and Val2 to integers on both dataframes (df1 and df2) and then merge by Val1 and Val2.

I'm sure there's a more efficient way to map Val1 and Val2 to integers, but the purpose of this answer is to show that merging on integers is faster.

# Turn Val1 and Val2 to categorical dtypes
df2[['Val1','Val2']] = df2[['Val1','Val2']].apply(lambda x: pd.Categorical(x))
# Turn categories to dictionaries
d1 = dict(enumerate(df2['Val1'].cat.categories))
d2 = dict(enumerate(df2['Val2'].cat.categories))
# Reverse keys and values in each dictionary
d1 = {v:k for k,v in d1.items()}
d2 = {v:k for k,v in d2.items()}
# Replace columns in df1
df1['Val1'].replace(d1, inplace=True)
df2['Val1'].replace(d1, inplace=True)
df1['Val2'].replace(d2, inplace=True)
df2['Val2'].replace(d2, inplace=True)
# Merge by integer versions of Val1 and Val2
df2.merge(df1, on=['Val1','Val2'], how='left')

Comparison of execution times:

# Merge with original keys
start = time.time()
df2.merge(df1, on=['Val1','Val2'], how='left')
round(time.time() - start, 5)

0.00672

# Merge with integer keys
start = time.time()
df2.merge(df1, on=['Val1','Val2'])
round(time.time() - start, 5)

0.00485

Upvotes: 1

Related Questions