user15649753
user15649753

Reputation: 523

how search on a specific rows of a dataframe to find a match in a second dataframe?

I have a large data set and I wanna do the following task in an efficient way. suppose we have 2 data frames. for each element in df2 I wanna search in the first data set df1 only in row where the first 2 letters are in common then the word with the most common token is choosen. Let's see in an example:

df1: 

common work   co
summer hot    su
apple         ap
colorful fall  co
support it     su
could comp     co

df2:

condition work it  co
common mistakes    co
could comp work    co
summer             su

Take the first row of df2 as an example (condition work it). I wanna find a row in df1 where they have the same first_two and have the most common token.

The first_two of condition work it is co. so I wanna search in df1 where first_two is co. So the search is done among: common work, colorful fall, could comp since condition work it has 1 common token with common work it is selected.

output:

df2:
name               first_two      match
condition work it  co              `common work`     
common mistakes    co              `common work`
could comp work    co              `could comp` 
summer             su              `summer hot'
appears            ap                  Nane

The last row is Nane since there is no common word between appears and apple

I did following:

df3=(df1.groupby(['first_two'])
      .agg({'name': lambda x: ",".join(x)})
      .reset_index())
merge_=df3.merge(df2, on='first_two',how='inner')

But now I have to search in name_x for each name_y. how to find an element of name_x whose has the most common token with name_y?

Upvotes: 0

Views: 47

Answers (1)

Bigya Pradhan
Bigya Pradhan

Reputation: 206

You have pretty much explained the most efficient method already.

  1. Extract first 2 letters using .str.[:2] for the series and assign it to new columns in both the dataframe.
  2. Extract unique values of 2 letter column from df2.
  3. Inner join the result from #2 on to df1.
  4. Perform a group by count on the result of #3 and sort descending based on the count and drop duplicates to get the most repeated item for the 2 letter column.
  5. Left join result of #4 onto df2.

Upvotes: 1

Related Questions