kdas
kdas

Reputation: 620

How to do data correlation clustering plot in python

I've got a database that contains information about commits done to a repo. For e.g

commit-sha1 | file1 | 
commit-sha1 | file2 |
commit-sha2 | file2 |
commit-sha2 | file3 | 

and so on. Basically, showing that sha1 changed files (file1, file2) and sha2 changed (file2, file3) Now I wanted to see if some files are co-related, i.e what are the chances that file1 and file2 are committed together etc. For this, first I found out top 50 files that are most commonly committed which gave me

file1 - 1500
file2 - 1423
file3 - 1222..

I've put -1 as d_value when Q(f1, f2) <= P(f1) * P(f2) i.e for e.g, as there were no commits in db which contained both file1 and file3 together (i.e Q(file1, file3) = 0), its d_value is -1. Now assuming I've the d_value list for pairs of files, how can I perform hierarchical clustering to see which files are co-related? I believe the python's linkage() API will help but I'm not sure how to use it with this data. Any help is appreciated Thanks

Upvotes: 0

Views: 281

Answers (1)

keineahnung2345
keineahnung2345

Reputation: 2701

A simple example:

from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
from matplotlib import pyplot as plt

d_value = np.array([ 3.2 , 100,  0.12,  7.6 , 100,  5.2 ])
Z = linkage(dm, 'ward')
fig = plt.figure()
dn = dendrogram(Z)

The result:

enter image description here

Note that I've changed your -1 into 100 since the distance of file1 and file3 should be large when they haven't been committed together.

Upvotes: 1

Related Questions