Isaac Rivera
Isaac Rivera

Reputation: 103

Save Jaccard Similarity in a CSV file

I have built the following code to analyze the Jaccard Similarity:

import pandas as pd
import csv

df = pd.read_csv('data.csv', usecols=[0]
                    ,names=['Question'], 
                       encoding='utf-8')

out = []
for i in df['Question']:
       str1 = i
       for q in df['Question']:
             str2 = q
             a = set(str1.split()) 
             b = set(str2.split())
             c = a.intersection(b)
             out.append({'Question': q,
                'Result': (float(len(c)) / (len(a) + len(b) - len(c)))})


new_df = pd.DataFrame(out, columns=['Question','Result'])
new_df.to_csv('output.csv', index=False, encoding='utf-8')

The Output file is like this:

Question          Result
The sky is blue    1.0
The ocean is blue  0.6
The sky is blue    0.6
The ocean is blue  1.0

which it does come back with the result, now, I would like to change the CSV output to show the results like this:

Question          The sky is blue The ocean is blue
The sky is blue    1.0             0.6
The ocean is blue  0.6             1.0

I was trying to change the code and use writerows but I guess I'm missing something, thanks in advance.

Upvotes: 0

Views: 326

Answers (1)

jezrael
jezrael

Reputation: 863791

Use defaultdict with DataFrame constructor:

from collections import defaultdict

out1 = defaultdict(dict)
for i in df['Question']:
       str1 = i
       for q in df['Question']:
             str2 = q
             a = set(str1.split()) 
             b = set(str2.split())
             c = a.intersection(b)
             out1[i][q] = (float(len(c)) / (len(a) + len(b) - len(c)))
print (out1)

df = pd.DataFrame(out1)
print (df)
                   The sky is blue  The ocean is blue
The ocean is blue              0.6                1.0
The sky is blue                1.0                0.6

Original solution with DataFrame.pivot:

out = []
for i in df['Question']:
       str1 = i
       for q in df['Question']:
             str2 = q
             a = set(str1.split()) 
             b = set(str2.split())
             c = a.intersection(b)
             out.append({'Question1': q, 'Question2': i,
                'Result': (float(len(c)) / (len(a) + len(b) - len(c)))})

df = pd.DataFrame(out).pivot('Question1', 'Question2', 'Result')
print (df)
Question2          The ocean is blue  The sky is blue
Question1                                            
The ocean is blue                1.0              0.6
The sky is blue                  0.6              1.0

Upvotes: 1

Related Questions