Efficiently grouping sets based on number of similar items

Question

I have a list of sets of the form

animals[0] = {'cat', 'cow', 'dog'}
animals[1] = {'cat', 'cow', 'pig'}
animals[2] = {'cat', 'dog', 'fish'}
animals[3] = {'tiger', 'fish', 'pig'}
animals[4] = {'dog', 'fish', 'pig'}

How can I group sets that have at least 2 similar items ? To make it simple, I will use a dict whose key is the first element of the overlapping set and do not care if some of the items belonging to that key have lesser than 2 similar items. For example,

g['cat'] = [0,1,2]
g['fish'] = [2,3,4]

because list 0 and 1 have the overlapping items cat and cow and list 0 and 2 have the overlapping items cat and dog. list 1 and 2 however, only have the overlapping item cat but are still included in the dict. A simple brute force approach would be to iterate through the entire list and check each list with every other list.

for i,x in enumerate(animals):
    for j,y in enumerate(animals):

        intersection = x & y
        if(len(intersection)>=2):
            dict[list(intersection)[0]].append(j)

But this is very time consuming if I have a very large list I think and so I would like to learn a better way to do this. If anyone has any function to recommend.

Luca Cappelletti · Accepted Answer

Solution using cosine similarity

About sparse matrices

If you are talking of a large dataset, you should consider using a sparse matrix. In particular, the technique I am going to describe scales well and can be used with MapReduce-like frameworks.

Before proceeding, therefore, I'll convert the example you have given into a sparse matrix. By no means the following approach is the fastest one, usually it depends on how you data are generated, for example if you already have a complete set or you are dealing with an online algorithm working on a stream.

Notes on unknown full set

If you are dealing with an unknown complete set, common issue when dealing with online algorithms, you can try and create an hash function family built to map your keys to N and gradually and hash functions until you have an low enough false positive rate (keys mapping to the bucket of another key).

Converting the sets to a sparse matrix

We will now proceed assuming that you have an ordered full set called all_the_animals, that allows you to map the unordered set to N.

animals = [{'cat', 'cow', 'dog'},
{'cat', 'cow', 'pig'},
{'cat', 'dog', 'fish'},
{'tiger', 'fish', 'pig'},
{'dog', 'fish', 'pig'}]

all_the_animals = ['cat', 'cow', 'dog', 'pig', 'fish', 'tiger']

The actual conversion:

import scipy.sparse as sparse

def binarise(sets, full_set):
    """Return sparse binary matrix of given sets."""
    return sparse.csr_matrix([[x in s for x in full_set] for s in sets])

So to get the sparse matrix you would run:

sparse_matrix = binarise(animals, all_the_animals)

Using the cosine similarity

Once you have the sparse matrix, you can proceed using the cosine similarity from sklearn, it being defined as:

from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(sparse_matrix)

Visualising cosine similarity

The obtain similarity matrix looks like, using seaborn.heatmap:

import matplotlib.pyplot as plt
import seaborn
seaborn.heatmap(s, annot=True, cmap="YlGnBu")
plt.show()

Thresholding

Now you can choose a threshold of similarity. For example, in your question, you are asking for the 2/3 of the elements to be in common:

threshold = 2/3

Using numpy.where you can execute:

import numpy as np
similar = np.where(similarities >= threshold)

Obtaining:

(array([0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4]),
 array([0, 1, 2, 0, 1, 0, 2, 4, 3, 4, 2, 3, 4]))

Extracting the similar groups

Now that we have the set of similar_animals we can run:

similar_sets = [(i, similar[1][similar[0]==i]) for i in np.unique(similar[0]))]

The result looks like:

[(0, array([0, 1, 2])),
 (1, array([0, 1])),
 (2, array([0, 2, 4])),
 (3, array([3, 4])),
 (4, array([2, 3, 4]))]

To visualise the results you can use the ordered set to get the animals name:

similar_animal_sets = [(all_the_animals[i], similar_set) for i, similar_set in similar_sets]

which outputs:

[('cat', array([0, 1, 2])),
 ('cow', array([0, 1])),
 ('dog', array([0, 2, 4])),
 ('pig', array([3, 4])),
 ('fish', array([2, 3, 4]))]