zlatan
zlatan

Reputation: 49

How to do Keyword matching across different dataframes in Pandas?

I have 2 dataframes across which I need to map the keywords. The input data(df1) looks like this:

    keyword            subtopic     
    post office        Brand        
    uspshelp uspshelp  Help         
    package delivery   Shipping     
    fed ex             Brand        
    ups fedex          Brand        
    delivery done      Shipping     
    united states      location     
    rt ups             retweet      

This is the other dataframe (df2) which is to be used for keyword matching:

Key     Media_type  cleaned_text
910040  facebook    will take post office
409535  twitter     need help with upshelp upshelp
218658  facebook    there no section post office alabama ups fedex
218658  facebook    there no section post office alabama ups fedex
518903  twitter     cant wait see exactly ups fedex truck package
2423281 twitter     fed ex messed seedless
763587  twitter     crazy package delivery rammed car
827572  twitter     formatting idead delivery done
2404106 facebook    supoused mexico united states america
1077739 twitter     rt ups

I want to map the 'keyword' column in df1 to the 'cleaned_text' column in df2 based on few conditions:

  1. One row in 'keyword' can be mapped to more than one row in 'cleaned_text' (One to many relationship)
  2. It should select the whole keyword together and not just individual words.
  3. If a 'keyword' matches to more than one row in 'cleaned_Text' it should create new records in the output dataframe(df3)

This is how the output dataframe(df3) should look like:

Key     Media_type  cleaned_text                                    keyword               subtopic  
910040  facebook    will take post office                           post office           Brand 
409535  twitter     need help with upshelp upshelp                  uspshelp uspshelp     Help  
218658  facebook    there no section post office alabama ups fedex  post office           Brand 
218658  facebook    there no section post office alabama ups fedex  ups fedex             Brand 
518903  twitter     cant wait see exactly ups fedex truck package   ups fedex             Brand 
2423281 twitter     fed ex messed seedless                          fed ex messed         Brand 
763587  twitter     crazy package delivery rammed car               package delivery      Shipping  
827572  twitter     formatting idead delivery done                  delivery done         Shipping  
2404106 facebook    supoused mexico united states america           united states america location  
1077739 twitter     rt ups                                          rt ups                retweet               

Upvotes: 0

Views: 218

Answers (1)

chatax
chatax

Reputation: 998

How about converting your df1 into a dictionary? And then loop through your df2 and search for matches. It is maybe not the most efficient way, but it is very readable

keyword_dict = {row.keyword: row.subtopic for row in df1.itertuples()}
df3_data = []
for row in df2.itertuples():
    text = row.cleaned_text
    for keyword in keyword_dict:
        if keyword in text:
            df3_row = [row.Key, row.Media_type, row.cleaned_text, keyword, keyword_dict[keyword]]
            df3_data.append(df3_row)

df3_columns = list(df2.columns) + list(df1.columns)
df3 = pd.DataFrame(df3_data, columns=df3_columns)

Upvotes: 1

Related Questions