aloneonthe_edge
aloneonthe_edge

Reputation: 31

Compare data from 2 columns and returning results in a different dataframe

I have a dataset which has around 400k rows. I need to find common words between question1 and question2 columns. I am able to print the output with a zip and for loop, however I would like to create a function to return these values. Can you please help me?

for a, b in zip(df.question1, df.question2):
    str1 = (set(a.lower().strip().split()))
    str2 = (set(b.lower().strip().split()))
    word_common =  (len(str1 & str2))
    word_total = len(str1) + len(str2)
    word_share = round(word_common/word_total,2)
    print(word_common,word_total,word_share)

This prints the output:

10 23 0.43
4 20 0.2
4 24 0.17

However, when i wrap this inside a function I get only one value (i.e. word_common) based on where i place return keyword. How can I store this output in a dataframe?

def find_common_words(df,strg1,strg2):
    for a, b in zip(df[strg1], df[strg2]):
        str1 = (set(a.lower().strip().split()))
        str2 = (set(b.lower().strip().split()))
        word_common =  (len(str1 & str2))
        word_total = len(str1) + len(str2)
        word_share = round(word_common/word_total,2)
        return word_common

Upvotes: 0

Views: 68

Answers (2)

Catalina Chircu
Catalina Chircu

Reputation: 1572

When you run return, the process in the function is stopped and the value is returned. So after the first iteration in your loop the program is stopped, because of your return statement, and the first value of word_common is returned. You sould rather stock your values in a list.

Secondly, as you have a DataFrame you should use apply function in order to output your list. It will take in input a function and will apply it on each row of the DataFrame.

In the following code, the value of word_common will be stocked in a new column of your DataFrame, named word_common:

def parse_one_row(row):
    a = row['question1']
    b = row['question2'] 
    str1 = (set(a.lower().strip().split()))
    str2 = (set(b.lower().strip().split()))
    word_common =  (len(str1 & str2))
    word_total = len(str1) + len(str2)
    word_share = round(word_common/word_total,2)
    return (word_common, word_total, word_share)


df['word_common'] = df.apply(parse_one_row, axis=1).apply(lambda x: x[0], axis=1)

Here you have the official documentation

Upvotes: 1

Ji Wei
Ji Wei

Reputation: 881

Use this to return the values in a dataframe:

def find_common_words(df,strg1,strg2):
    stats = []
    for a, b in zip(df[strg1], df[strg2]):
        str1 = (set(a.lower().strip().split()))
        str2 = (set(b.lower().strip().split()))
        word_common =  (len(str1 & str2))
        word_total = len(str1) + len(str2)
        word_share = round(word_common/word_total,2)
        stats += [[word_common, word_total, word_share]]
    return pd.DataFrame(stats, columns=['Word Common', 'Word Total', 'Word Share'])

Upvotes: 0

Related Questions