emax
emax

Reputation: 7235

Python: jaccard similartity between two networks?

I have 2 large networks G and G1 generated with the networkx package. I want to compute the jaccard similarity index between all the nodes.

One possible way is the following:

def returnJaccardNetworks(G, G1):
    tmp =   list(G.nodes())
    tmp1 =  list(G1.nodes())
    tmp2 =  np.unique([tmp, tmp1]) ### Find nodes in the networks
    jc = []
    for i in tmp2:
    ## if the node i is in G and in G1 compute 
    ## the similarity between the lists of the ajacent nodes
    ## otherwise append 0
        if (i in G) and (i in G1):  
            k1 = list(G[i]) ## adjacent nodes of i in the network G     
            k2 = list(G1[i]) ## adjacent nodes of i in the network G1 
            ### Start Jaccard Similarity
            intersect = list(set(k1) & set(k2))
            n = len(intersect)
            jc.append(n / float(len(k1) + len(k2) - n))
            ### End Jaccard Similariy
        else:
            jc.append(0)
    return jc

I am wondering if there is a more efficient way. I have noticed that there is a function in the package called jaccard_coefficient but I am not sure how it works.

https://networkx.github.io/documentation/networkx-1.9/reference/generated/networkx.algorithms.link_prediction.jaccard_coefficient.html

Upvotes: 3

Views: 2462

Answers (1)

Paul Brodersen
Paul Brodersen

Reputation: 13031

Your implementation is pretty damn efficient (albeit not pretty, IMO). I can shave off 15% execution time on my machine with this version:

def get_jaccard_coefficients(G, H):
    for v in G:
        if v in H:
            n = set(G[v]) # neighbors of v in G
            m = set(H[v]) # neighbors of v in H
            length_intersection = len(n & m)
            length_union = len(n) + len(m) - length_intersection
            yield v, float(length_intersection) / length_union
        else:
            yield v, 0. # should really yield v, None as measure is not defined for these nodes

This other version is much more compact and easier to maintain at the cost of a 30% increase in execution time:

def get_jaccard_coefficients(G, H):
    for v in set(G.nodes) & set(H.nodes): # i.e. the intersection
        n = set(G[v]) # neighbors of v in G
        m = set(H[v]) # neighbors of v in H
        yield v, len(n & m) / float(len(n | m))

Upvotes: 4

Related Questions