Reputation: 31
I have a dataset which has around 400k rows. I need to find common words between question1
and question2
columns. I am able to print the output with a zip
and for
loop, however I would like to create a function to return these values. Can you please help me?
for a, b in zip(df.question1, df.question2):
str1 = (set(a.lower().strip().split()))
str2 = (set(b.lower().strip().split()))
word_common = (len(str1 & str2))
word_total = len(str1) + len(str2)
word_share = round(word_common/word_total,2)
print(word_common,word_total,word_share)
This prints the output:
10 23 0.43
4 20 0.2
4 24 0.17
However, when i wrap this inside a function I get only one value (i.e. word_common
) based on where i place return
keyword. How can I store this output in a dataframe?
def find_common_words(df,strg1,strg2):
for a, b in zip(df[strg1], df[strg2]):
str1 = (set(a.lower().strip().split()))
str2 = (set(b.lower().strip().split()))
word_common = (len(str1 & str2))
word_total = len(str1) + len(str2)
word_share = round(word_common/word_total,2)
return word_common
Upvotes: 0
Views: 68
Reputation: 1572
When you run return
, the process in the function is stopped and the value is returned. So after the first iteration in your loop the program is stopped, because of your return statement, and the first value of word_common is returned. You sould rather stock your values in a list.
Secondly, as you have a DataFrame you should use apply
function in order to output your list. It will take in input a function and will apply it on each row of the DataFrame.
In the following code, the value of word_common
will be stocked in a new column of your DataFrame, named word_common
:
def parse_one_row(row):
a = row['question1']
b = row['question2']
str1 = (set(a.lower().strip().split()))
str2 = (set(b.lower().strip().split()))
word_common = (len(str1 & str2))
word_total = len(str1) + len(str2)
word_share = round(word_common/word_total,2)
return (word_common, word_total, word_share)
df['word_common'] = df.apply(parse_one_row, axis=1).apply(lambda x: x[0], axis=1)
Here you have the official documentation
Upvotes: 1
Reputation: 881
Use this to return the values in a dataframe:
def find_common_words(df,strg1,strg2):
stats = []
for a, b in zip(df[strg1], df[strg2]):
str1 = (set(a.lower().strip().split()))
str2 = (set(b.lower().strip().split()))
word_common = (len(str1 & str2))
word_total = len(str1) + len(str2)
word_share = round(word_common/word_total,2)
stats += [[word_common, word_total, word_share]]
return pd.DataFrame(stats, columns=['Word Common', 'Word Total', 'Word Share'])
Upvotes: 0