Reputation: 527
Given an array of sets find the one that does not belong:
example: [[a,b,c,d], [a,b,f,g], [a,b,h,i], [j,k,l,m]]
output: [j,k,l,m]We can see above that the first three sets have a common subset [a,b] and the last one does not. Note: There may be a case where the outlier set does have elements contained in the input group. In this case we have to find the set that has the least in common with the other sets.
I have tried iterating over the input list and keeping a count for each character (in a hash).
In a second pass, find which set has the smallest total.
In the example above, the last set would have a sum of counts of 4:
j*1 + k*1 + l*1 + m*1.
I'd like to know if there are better ways to do this.
Upvotes: 2
Views: 501
Reputation: 101
You shouldn't be looking for the smallest sum of count of elements. It is dependent on the size of the set. But if you substract the size of the set from the sum, it's 0 only if the set is disjoint from all the others. Another option, is to look at the maximum of the count of its elements. If the maximum is one on a set, then they only belong to the set.
There are many functions you can use. As the note states:
Note: There may be a case where the outlier set does have elements contained in the input group. In this case we have to find the set that has the least in common with the other sets.
The previous functions are not optimal. A better function would count the number of shared elements. Set the value of an element to 1 if it's in multiple sets and 0 if it appears only once.
Upvotes: 4
Reputation: 77837
Your description:
find the set that has the least in common with the other sets
Doing this as a general application would require computing similarity with each individual pair of sets; this does not seem to be what you describe algorithmically. Also, it's an annoying O(n^2) algorithm.
I suggest the follow amendment and clarification
find the set that least conforms to the mean of the entire list of sets.
This matches your description much better, and can be done in two simple passes, O(n*m) where you have n
sets of size m
.
The approach you outlined does the job quite nicely: count the occurrences of each element in all of the sets, O(nm). Then score each set according to the elements it contains, also O(nm). Keep track of which element has the lowest score.
For additional "accuracy", you could sort the scores and look for gaps in the scoring -- this would point out multiple outliers.
If you do this in Python, use the Counter
class for your tally.
Upvotes: 5