Fastest way of searching matching string in two large dataframes using Pandas Python

Question

I have two large dataframes df1 --> 100K rows and df2 --> 600K rows and they look like the following

# df1
                                      name   price    brand  model 
0       CANON CAMERA 20 FS36dINFS MEGAPIXEL  9900.0   CANON  FS36dINFS     
1               SONY HD CAMERA 25 MEGAPIXEL  8900.0    SONY         
2       LG 55" 4K UHD LED Smart TV 55UJ635V  5890.0     LG   55UJ635V       
3      Sony 65" LED Smart TV KD-65XD8505BAE  4790.0    SONY  KD-65XD8505BAE     
4       LG 49" 4K UHD LED Smart TV 49UJ651V  4390.0      LG  49UJ651V     

#df2

                                  name       store      price
0     LG 49" 4K UHD LED Smart TV 49UJ651V     storeA   4790.0
1             SONY HD CAMERA 25 MEGAPIXEL     storeA   12.90
2  Samsung 32" LED Smart TV UE-32J4505XXE     storeB    1.30

I want to match if brand and other features from df1 are in df2 and if they exist then I do something. Currently I am using a naive approach of iterating through both the dataframes like the following

datalist = []
for idx1, row1 in df1.iterrow():
    for idx2, row2 in df2.iterrows():
        if(row1['brand'] in row2['name'] and row1['model'] in row2['name']):
                datalist.append([row1['model'],  row1['brand'], row1['name'],  row1['price'], row2['name'],row2['price'], row2['store']])

But this is taking a lot of time because both dataframes are big. I studied that sets are faster but here, the way I am using the dataframes using iterrows I can't convert to set because then I'll lose the positions. Is there any faster to do that ?

unutbu · Accepted Answer

If there is a lot of repetition in df1['brand'] and df1['model'], then you could possibly improve performance by creating regex patterns for the brands and models:

brands = '({})'.format('|'.join(df1['brand'].dropna().unique()))
# '(CANON|SONY|LG)'
models = '({})'.format('|'.join(df1['model'].dropna().unique()))
# '(FS36dINFS|55UJ635V|KD-65XD8505BAE|49UJ651V)'

Then you could use the str.extract method to find brand and model strings from df2['name']:

df2['brand'] = df2['name'].str.extract(brands, expand=False)
df2['model'] = df2['name'].str.extract(models, expand=False)

Then you could obtain the desired data in the form of a DataFrame by performing an inner-merge:

result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner')

import re
import sys
import pandas as pd
pd.options.display.width = sys.maxsize

df1 = pd.DataFrame({'brand': ['CANON', 'SONY', 'LG', 'SONY', 'LG'], 'model': ['FS36dINFS', None, '55UJ635V', 'KD-65XD8505BAE', '49UJ651V'], 'name': ['CANON CAMERA 20 FS36dINFS MEGAPIXEL', 'SONY HD CAMERA 25 MEGAPIXEL', 'LG 55" 4K UHD LED Smart TV 55UJ635V', 'Sony 65" LED Smart TV KD-65XD8505BAE', 'LG 49" 4K UHD LED Smart TV 49UJ651V'], 'price': [9900.0, 8900.0, 5890.0, 4790.0, 4390.0]})

df2 = pd.DataFrame({'name': ['LG 49" 4K UHD LED Smart TV 49UJ651V', 'SONY HD CAMERA 25 MEGAPIXEL', 'Samsung 32" LED Smart TV UE-32J4505XXE'], 'price': [4790.0, 12.9, 1.3], 'store': ['storeA', 'storeA', 'storeB']})

bm = ['brand','model']
for col in bm:
    keywords = [re.escape(item) for item in df1[col].dropna().unique()]
    pat = '({})'.format('|'.join(keywords))
    df2[col] = df2['name'].str.extract(pat, expand=False)
result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner')
print(result)

yields

  brand     model                               name_x  price_x                               name_y  price_y   store
0    LG  49UJ651V  LG 49" 4K UHD LED Smart TV 49UJ651V   4390.0  LG 49" 4K UHD LED Smart TV 49UJ651V   4790.0  storeA

Fastest way of searching matching string in two large dataframes using Pandas Python

Answers (1)

Related Questions