Reputation: 4842
I have a df_1
like this:
A
apple, iphone, android
facebook, apple
macbook, laptop
firestick, hulu, netflix
android, laptop
laptop
And df_2
like this:
A B
apple 1
macbook 2
facebook 3
firestick 4
hulu 5
netflix 6
android 7
laptop 8
I am trying to extract a single word from A
column of df_1
that has the lowest value in column B
from df_2
like so:
A B_new
apple, iphone, android apple
facebook, apple apple
macbook, laptop macbook
hulu, netflix, firestick firesick
laptop, android android
laptop laptop
I assume I could sort each value of df_1
column A
based on the values of B
in df_2
. Or create a function that takes in a single A
value from df_1
and returns str
with the smallest number in B
from df_2
. But as the data is quite big I assume using apply
is not very efficient. Is there a neat Pandas way of doing such task?
Upvotes: 2
Views: 74
Reputation: 28669
@jezrael's solution is cleaner, and should be faster, since we are dealing with strings; the solution below is an alternative:
Iterate to get the individual entries:
values = [[value.strip() for value in entry.split(",")]
for entry in df1.A.__iter__()]
values
[['apple', 'iphone', 'android'],
['facebook', 'apple'],
['macbook', 'laptop'],
['firestick', 'hulu', 'netflix'],
['android', 'laptop'],
['laptop']]
Get the minimum, which in this case will be the first True:
values = [df2.loc[df2.A.isin(value), "B"].idxmin()
for value in values]
values
[0, 0, 1, 3, 6, 7]
Select the values and assign to the new column:
df1.loc[:, 'B_new'] = df2.iloc[values, 0]
A B_new
0 apple, iphone, android apple
1 facebook, apple apple
2 macbook, laptop macbook
3 firestick, hulu, netflix firestick
4 android, laptop android
5 laptop laptop
Upvotes: 1
Reputation: 2167
Not the fastest either, but I like this other way of thinking about this.
You could use the method str.get_dummies, perform column wise multiplications and take the idxmin :
ab_dict = df_2.set_index("A")["B"].to_dict()
df_1["B"] = df_1["A"].str.get_dummies(", ")
.apply(lambda c: c.replace(0, np.nan)*ab_dict.get(c.name, np.nan))
.idxmin(axis=1)
Upvotes: 0
Reputation: 862741
You can create dictionary and matching values if exist, then get maximal value else missing value:
d = df_2.set_index('A')['B'].to_dict()
def f(x):
d1 = {y:d[y] for y in x.split(', ') if y in d}
return min(d1, key=d1.get) if len(d1) > 1 else np.nan
Or:
import operator
def f(x):
d1 = {y:d[y] for y in x.split(', ') if y in d}
return min(d1.items(), key=operator.itemgetter(1))[0] if len(d1) > 1 else np.nan
df_1['new'] = df_1['A'].apply(f)
print (df_1)
A new
0 apple, iphone, android apple
1 facebook, apple apple
2 macbook, laptop macbook
3 firestick, hulu, netflix firestick
4 android, laptop android
5 laptop laptop
Upvotes: 2