Applying function on pandas column using information from another column

Question

I have a dataframe that contains a bunch of people's text descriptions. Other than that, I also have 4 descriptions a,b,c,d. For each person's text description, I wish to compare them to each of the 4 descriptions by using cosine similarity and store these scores in the same dataframe in 4 new columns: a, b, c, d.

How can I do this in a panda way, without using for loops? I was thinking of using the apply function but I don't know how to reference to the 'text' column as well as the 4 descriptions a,b,c,d in the apply function.

Thank you very much for any help!!

What I have tried:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

person_one = [' '.join(['table','car','mouse'])]
person_two = [' '.join(['computer','card','can','mouse'])]
person_three = [' '.join(['chair','table','whiteboard','window','button'])]
person_four = [' '.join(['queen','king','joker','phone'])]

description_a = [' '.join(['table','yellow','car','king'])]
description_b = [' '.join(['bottle','whiteboard','queen'])]
description_c = [' '.join(['chair','car','car','phone'])]
description_d = [' '.join(['joker','blue','earphone','king'])]

mystuff = [('person 1',person_one),
           ('person 2',person_two),
           ('person 3',person_three),
           ('person 4',person_four)
           ]

labels = ['person','text']

df = pd.DataFrame.from_records(mystuff,columns = labels)
df = df.reindex(columns = ['person','text','a','b','c','d'])

def trying(cell,jd):
    vectorizer = CountVectorizer(analyzer='word', max_features=5000).fit(jd)
    jd_vector = vectorizer.transform(jd)
    person_vector = vectorizer.transform(cell['text'])
    score = cosine_similarity(jd_vector,person_vector)

    return score


df['a'] = df['a'].apply(trying(description_a))
df['b'] = df['b'].apply(trying(description_b))
df['c'] = df['c'].apply(trying(description_c))
df['d'] = df['d'].apply(trying(description_d))

This gives me an error:

df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'

The output should look something like this:

     person                                        text   a   b   c   d
0  person 1                         [table, car, mouse] 0.3 0.2 0.5 0.7
1  person 2                [computer, card, can, mouse] 0.2 0.1 0.9 0.7
2  person 3  [chair, table, whiteboard, window, button] 0.3 0.5 0.1 0.4
3  person 4                 [queen, king, joker, phone] 0.2 0.4 0.3 0.5

mensik · Accepted Answer

How about this:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


person_one = ['table','car','mouse']
person_two = ['computer','card','can','mouse']
person_three = ['chair','table','whiteboard','window','button']
person_four = ['queen','king','joker','phone']

description_a = ['table','yellow','car','king']
description_b = ['bottle','whiteboard','queen']
description_c = ['chair','car','car','phone']
description_d = ['joker','blue','earphone','king']

descriptors = {
    'a' : description_a,
    'b' : description_d,
    'c' : description_c,
    'd' : description_d
}

mystuff = [('person 1',person_one),
           ('person 2',person_two),
           ('person 3',person_three),
           ('person 4',person_four)
           ]

labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)

vocabulary_data =[
    person_one,
    person_two,
    person_three,
    person_four,
    description_a,
    description_b,
    description_c,
    description_d,
]

data = [set(sentence) for sentence in vocabulary_data]
vocabulary = set.union(*data)
cv = CountVectorizer(vocabulary=vocabulary)


def similarity(row, desc):
    a = cosine_similarity(cv.fit_transform(row['text']).sum(axis=0), cv.fit_transform(desc).sum(axis=0))
    return a.item()

for key, description in descriptors.items():
    df[key] = df.apply(lambda x: similarity(x, description), axis=1)

I used one for loop, but only for filling different descriptions. The main "computation" is done by apply.

Applying function on pandas column using information from another column

Answers (2)

Related Questions