Amoroso
Amoroso

Reputation: 1003

Applying function on pandas column using information from another column

I have a dataframe that contains a bunch of people's text descriptions. Other than that, I also have 4 descriptions a,b,c,d. For each person's text description, I wish to compare them to each of the 4 descriptions by using cosine similarity and store these scores in the same dataframe in 4 new columns: a, b, c, d.

How can I do this in a panda way, without using for loops? I was thinking of using the apply function but I don't know how to reference to the 'text' column as well as the 4 descriptions a,b,c,d in the apply function.

Thank you very much for any help!!

What I have tried:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

person_one = [' '.join(['table','car','mouse'])]
person_two = [' '.join(['computer','card','can','mouse'])]
person_three = [' '.join(['chair','table','whiteboard','window','button'])]
person_four = [' '.join(['queen','king','joker','phone'])]

description_a = [' '.join(['table','yellow','car','king'])]
description_b = [' '.join(['bottle','whiteboard','queen'])]
description_c = [' '.join(['chair','car','car','phone'])]
description_d = [' '.join(['joker','blue','earphone','king'])]

mystuff = [('person 1',person_one),
           ('person 2',person_two),
           ('person 3',person_three),
           ('person 4',person_four)
           ]

labels = ['person','text']

df = pd.DataFrame.from_records(mystuff,columns = labels)
df = df.reindex(columns = ['person','text','a','b','c','d'])

def trying(cell,jd):
    vectorizer = CountVectorizer(analyzer='word', max_features=5000).fit(jd)
    jd_vector = vectorizer.transform(jd)
    person_vector = vectorizer.transform(cell['text'])
    score = cosine_similarity(jd_vector,person_vector)

    return score


df['a'] = df['a'].apply(trying(description_a))
df['b'] = df['b'].apply(trying(description_b))
df['c'] = df['c'].apply(trying(description_c))
df['d'] = df['d'].apply(trying(description_d))

This gives me an error:

df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'

The output should look something like this:

     person                                        text   a   b   c   d
0  person 1                         [table, car, mouse] 0.3 0.2 0.5 0.7
1  person 2                [computer, card, can, mouse] 0.2 0.1 0.9 0.7
2  person 3  [chair, table, whiteboard, window, button] 0.3 0.5 0.1 0.4
3  person 4                 [queen, king, joker, phone] 0.2 0.4 0.3 0.5

Upvotes: 0

Views: 1046

Answers (2)

Tbaki
Tbaki

Reputation: 1003

I can't post comment yet, but to solve the error :

df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'

You need to pass the parameter like this :

df['a'] = df['a'].apply(trying, args=(description_a))

The first argument will be the column vector in your case, and the other arguments will then be taken in order from ther args list.

Hope this help.

Upvotes: 4

mensik
mensik

Reputation: 46

How about this:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


person_one = ['table','car','mouse']
person_two = ['computer','card','can','mouse']
person_three = ['chair','table','whiteboard','window','button']
person_four = ['queen','king','joker','phone']

description_a = ['table','yellow','car','king']
description_b = ['bottle','whiteboard','queen']
description_c = ['chair','car','car','phone']
description_d = ['joker','blue','earphone','king']

descriptors = {
    'a' : description_a,
    'b' : description_d,
    'c' : description_c,
    'd' : description_d
}

mystuff = [('person 1',person_one),
           ('person 2',person_two),
           ('person 3',person_three),
           ('person 4',person_four)
           ]

labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)

vocabulary_data =[
    person_one,
    person_two,
    person_three,
    person_four,
    description_a,
    description_b,
    description_c,
    description_d,
]

data = [set(sentence) for sentence in vocabulary_data]
vocabulary = set.union(*data)
cv = CountVectorizer(vocabulary=vocabulary)


def similarity(row, desc):
    a = cosine_similarity(cv.fit_transform(row['text']).sum(axis=0), cv.fit_transform(desc).sum(axis=0))
    return a.item()

for key, description in descriptors.items():
    df[key] = df.apply(lambda x: similarity(x, description), axis=1)

I used one for loop, but only for filling different descriptions. The main "computation" is done by apply.

Upvotes: 0

Related Questions