Reputation: 523
I have a large data set and I wanna do the following task in an efficient way. suppose we have 2 data frames. for each element in df2
I wanna search in the first data set df1
only in row where the first 2 letters are in common then the word with the most common token is choosen.
Let's see in an example:
df1:
common work co
summer hot su
apple ap
colorful fall co
support it su
could comp co
df2:
condition work it co
common mistakes co
could comp work co
summer su
Take the first row of df2
as an example (condition work it
). I wanna find a row in df1
where they have the same first_two
and have the most common token.
The first_two
of condition work it
is co
. so I wanna search in df1
where first_two
is co
. So the search is done among: common work
, colorful fall
, could comp
since condition work it
has 1 common token with common work
it is selected.
output:
df2:
name first_two match
condition work it co `common work`
common mistakes co `common work`
could comp work co `could comp`
summer su `summer hot'
appears ap Nane
The last row is Nane since there is no common word between appears
and apple
I did following:
df3=(df1.groupby(['first_two'])
.agg({'name': lambda x: ",".join(x)})
.reset_index())
merge_=df3.merge(df2, on='first_two',how='inner')
But now I have to search in name_x
for each name_y
. how to find an element of name_x
whose has the most common token with name_y
?
Upvotes: 0
Views: 47
Reputation: 206
You have pretty much explained the most efficient method already.
.str.[:2]
for the series and assign it to new columns in both the dataframe.Upvotes: 1