Reputation: 103
I have built the following code to analyze the Jaccard Similarity:
import pandas as pd
import csv
df = pd.read_csv('data.csv', usecols=[0]
,names=['Question'],
encoding='utf-8')
out = []
for i in df['Question']:
str1 = i
for q in df['Question']:
str2 = q
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
out.append({'Question': q,
'Result': (float(len(c)) / (len(a) + len(b) - len(c)))})
new_df = pd.DataFrame(out, columns=['Question','Result'])
new_df.to_csv('output.csv', index=False, encoding='utf-8')
The Output file is like this:
Question Result
The sky is blue 1.0
The ocean is blue 0.6
The sky is blue 0.6
The ocean is blue 1.0
which it does come back with the result, now, I would like to change the CSV output to show the results like this:
Question The sky is blue The ocean is blue
The sky is blue 1.0 0.6
The ocean is blue 0.6 1.0
I was trying to change the code and use writerows but I guess I'm missing something, thanks in advance.
Upvotes: 0
Views: 326
Reputation: 863791
Use defaultdict
with DataFrame
constructor:
from collections import defaultdict
out1 = defaultdict(dict)
for i in df['Question']:
str1 = i
for q in df['Question']:
str2 = q
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
out1[i][q] = (float(len(c)) / (len(a) + len(b) - len(c)))
print (out1)
df = pd.DataFrame(out1)
print (df)
The sky is blue The ocean is blue
The ocean is blue 0.6 1.0
The sky is blue 1.0 0.6
Original solution with DataFrame.pivot
:
out = []
for i in df['Question']:
str1 = i
for q in df['Question']:
str2 = q
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
out.append({'Question1': q, 'Question2': i,
'Result': (float(len(c)) / (len(a) + len(b) - len(c)))})
df = pd.DataFrame(out).pivot('Question1', 'Question2', 'Result')
print (df)
Question2 The ocean is blue The sky is blue
Question1
The ocean is blue 1.0 0.6
The sky is blue 0.6 1.0
Upvotes: 1