Reputation: 1003
I have a dataframe that contains a bunch of people's text descriptions. Other than that, I also have 4 descriptions a,b,c,d. For each person's text description, I wish to compare them to each of the 4 descriptions by using cosine similarity and store these scores in the same dataframe in 4 new columns: a, b, c, d.
How can I do this in a panda way, without using for loops? I was thinking of using the apply function but I don't know how to reference to the 'text' column as well as the 4 descriptions a,b,c,d in the apply function.
Thank you very much for any help!!
What I have tried:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
person_one = [' '.join(['table','car','mouse'])]
person_two = [' '.join(['computer','card','can','mouse'])]
person_three = [' '.join(['chair','table','whiteboard','window','button'])]
person_four = [' '.join(['queen','king','joker','phone'])]
description_a = [' '.join(['table','yellow','car','king'])]
description_b = [' '.join(['bottle','whiteboard','queen'])]
description_c = [' '.join(['chair','car','car','phone'])]
description_d = [' '.join(['joker','blue','earphone','king'])]
mystuff = [('person 1',person_one),
('person 2',person_two),
('person 3',person_three),
('person 4',person_four)
]
labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)
df = df.reindex(columns = ['person','text','a','b','c','d'])
def trying(cell,jd):
vectorizer = CountVectorizer(analyzer='word', max_features=5000).fit(jd)
jd_vector = vectorizer.transform(jd)
person_vector = vectorizer.transform(cell['text'])
score = cosine_similarity(jd_vector,person_vector)
return score
df['a'] = df['a'].apply(trying(description_a))
df['b'] = df['b'].apply(trying(description_b))
df['c'] = df['c'].apply(trying(description_c))
df['d'] = df['d'].apply(trying(description_d))
This gives me an error:
df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'
The output should look something like this:
person text a b c d
0 person 1 [table, car, mouse] 0.3 0.2 0.5 0.7
1 person 2 [computer, card, can, mouse] 0.2 0.1 0.9 0.7
2 person 3 [chair, table, whiteboard, window, button] 0.3 0.5 0.1 0.4
3 person 4 [queen, king, joker, phone] 0.2 0.4 0.3 0.5
Upvotes: 0
Views: 1046
Reputation: 1003
I can't post comment yet, but to solve the error :
df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'
You need to pass the parameter like this :
df['a'] = df['a'].apply(trying, args=(description_a))
The first argument will be the column vector in your case, and the other arguments will then be taken in order from ther args list.
Hope this help.
Upvotes: 4
Reputation: 46
How about this:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
person_one = ['table','car','mouse']
person_two = ['computer','card','can','mouse']
person_three = ['chair','table','whiteboard','window','button']
person_four = ['queen','king','joker','phone']
description_a = ['table','yellow','car','king']
description_b = ['bottle','whiteboard','queen']
description_c = ['chair','car','car','phone']
description_d = ['joker','blue','earphone','king']
descriptors = {
'a' : description_a,
'b' : description_d,
'c' : description_c,
'd' : description_d
}
mystuff = [('person 1',person_one),
('person 2',person_two),
('person 3',person_three),
('person 4',person_four)
]
labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)
vocabulary_data =[
person_one,
person_two,
person_three,
person_four,
description_a,
description_b,
description_c,
description_d,
]
data = [set(sentence) for sentence in vocabulary_data]
vocabulary = set.union(*data)
cv = CountVectorizer(vocabulary=vocabulary)
def similarity(row, desc):
a = cosine_similarity(cv.fit_transform(row['text']).sum(axis=0), cv.fit_transform(desc).sum(axis=0))
return a.item()
for key, description in descriptors.items():
df[key] = df.apply(lambda x: similarity(x, description), axis=1)
I used one for loop, but only for filling different descriptions. The main "computation" is done by apply.
Upvotes: 0