Compare data from 2 columns and returning results in a different dataframe

Question

I have a dataset which has around 400k rows. I need to find common words between question1 and question2 columns. I am able to print the output with a zip and for loop, however I would like to create a function to return these values. Can you please help me?

for a, b in zip(df.question1, df.question2):
    str1 = (set(a.lower().strip().split()))
    str2 = (set(b.lower().strip().split()))
    word_common =  (len(str1 & str2))
    word_total = len(str1) + len(str2)
    word_share = round(word_common/word_total,2)
    print(word_common,word_total,word_share)

This prints the output:

10 23 0.43
4 20 0.2
4 24 0.17

However, when i wrap this inside a function I get only one value (i.e. word_common) based on where i place return keyword. How can I store this output in a dataframe?

def find_common_words(df,strg1,strg2):
    for a, b in zip(df[strg1], df[strg2]):
        str1 = (set(a.lower().strip().split()))
        str2 = (set(b.lower().strip().split()))
        word_common =  (len(str1 & str2))
        word_total = len(str1) + len(str2)
        word_share = round(word_common/word_total,2)
        return word_common

Ji Wei · Accepted Answer

Use this to return the values in a dataframe:

def find_common_words(df,strg1,strg2):
    stats = []
    for a, b in zip(df[strg1], df[strg2]):
        str1 = (set(a.lower().strip().split()))
        str2 = (set(b.lower().strip().split()))
        word_common =  (len(str1 & str2))
        word_total = len(str1) + len(str2)
        word_share = round(word_common/word_total,2)
        stats += [[word_common, word_total, word_share]]
    return pd.DataFrame(stats, columns=['Word Common', 'Word Total', 'Word Share'])

Compare data from 2 columns and returning results in a different dataframe

Answers (2)

Related Questions