Reputation: 2410
I have a list of sets of the form
animals[0] = {'cat', 'cow', 'dog'}
animals[1] = {'cat', 'cow', 'pig'}
animals[2] = {'cat', 'dog', 'fish'}
animals[3] = {'tiger', 'fish', 'pig'}
animals[4] = {'dog', 'fish', 'pig'}
How can I group sets that have at least 2 similar items ? To make it simple, I will use a dict whose key is the first element of the overlapping set and do not care if some of the items belonging to that key have lesser than 2 similar items. For example,
g['cat'] = [0,1,2]
g['fish'] = [2,3,4]
because list 0 and 1 have the overlapping items cat and cow and list 0 and 2 have the overlapping items cat and dog. list 1 and 2 however, only have the overlapping item cat but are still included in the dict. A simple brute force approach would be to iterate through the entire list and check each list with every other list.
for i,x in enumerate(animals):
for j,y in enumerate(animals):
intersection = x & y
if(len(intersection)>=2):
dict[list(intersection)[0]].append(j)
But this is very time consuming if I have a very large list I think and so I would like to learn a better way to do this. If anyone has any function to recommend.
Upvotes: 3
Views: 1440
Reputation: 2545
If you are talking of a large dataset, you should consider using a sparse matrix. In particular, the technique I am going to describe scales well and can be used with MapReduce-like frameworks.
Before proceeding, therefore, I'll convert the example you have given into a sparse matrix. By no means the following approach is the fastest one, usually it depends on how you data are generated, for example if you already have a complete set or you are dealing with an online algorithm working on a stream.
If you are dealing with an unknown complete set, common issue when dealing with online algorithms, you can try and create an hash function family built to map your keys to N and gradually and hash functions until you have an low enough false positive rate (keys mapping to the bucket of another key).
We will now proceed assuming that you have an ordered full set called all_the_animals
, that allows you to map the unordered set to N.
animals = [{'cat', 'cow', 'dog'},
{'cat', 'cow', 'pig'},
{'cat', 'dog', 'fish'},
{'tiger', 'fish', 'pig'},
{'dog', 'fish', 'pig'}]
all_the_animals = ['cat', 'cow', 'dog', 'pig', 'fish', 'tiger']
The actual conversion:
import scipy.sparse as sparse
def binarise(sets, full_set):
"""Return sparse binary matrix of given sets."""
return sparse.csr_matrix([[x in s for x in full_set] for s in sets])
So to get the sparse matrix you would run:
sparse_matrix = binarise(animals, all_the_animals)
Once you have the sparse matrix, you can proceed using the cosine similarity from sklearn
, it being defined as:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(sparse_matrix)
The obtain similarity matrix looks like, using seaborn.heatmap
:
import matplotlib.pyplot as plt
import seaborn
seaborn.heatmap(s, annot=True, cmap="YlGnBu")
plt.show()
Now you can choose a threshold of similarity. For example, in your question, you are asking for the 2/3 of the elements to be in common:
threshold = 2/3
Using numpy.where
you can execute:
import numpy as np
similar = np.where(similarities >= threshold)
Obtaining:
(array([0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4]),
array([0, 1, 2, 0, 1, 0, 2, 4, 3, 4, 2, 3, 4]))
Now that we have the set of similar_animals we can run:
similar_sets = [(i, similar[1][similar[0]==i]) for i in np.unique(similar[0]))]
The result looks like:
[(0, array([0, 1, 2])),
(1, array([0, 1])),
(2, array([0, 2, 4])),
(3, array([3, 4])),
(4, array([2, 3, 4]))]
To visualise the results you can use the ordered set to get the animals name:
similar_animal_sets = [(all_the_animals[i], similar_set) for i, similar_set in similar_sets]
which outputs:
[('cat', array([0, 1, 2])),
('cow', array([0, 1])),
('dog', array([0, 2, 4])),
('pig', array([3, 4])),
('fish', array([2, 3, 4]))]
Upvotes: 6