Reputation: 3885
I have a lot of lists, and I want to find group of most common elements that appear in the lists.
For example:
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
How to find group of elements that appear in these lists, for example:
Group 1: ['apple', 'banana', 'orange'] in l1, l2, appear 2 times
Group 2: ['apple', 'grape', 'kiwi'] in l4, l5, appear 2 times
Group 3: ['apple', 'grape', 'orange'] in l2, l5, appear 2 times
Index of the element is not important. Group should have minimum 3 and maximum 5 elements. Lists can have from 3 to 10 elements.
I know that I can do something like this with intersections, but what if I have totally different list:
l7 = ["x", "y", "z", "k"]
Elements from this list are not appearing in any other list
Upvotes: 1
Views: 114
Reputation: 96
You can use A.issubset(B)
in order to check if elements of A are present in B or not, it will simply return False.
def count1(grp=None, list_num=None):
return grp.issubset(list_num)
if __name__ == "__main__":
g1 = {'apple', 'banana', 'orange'}
g2 = {'apple', 'grape', 'kiwi'}
g3 = {'apple', 'grape', 'orange'}
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
group_count = {}
groups = [g1, g2, g3]
list_check = [l1, l2, l3, l4, l5, l6]
for index_list, list_num in enumerate(list_check):
for index, grp in enumerate(groups):
if count1(grp, list_num):
group_count.setdefault(f'g{index + 1}',[]).append(f'l{index_list + 1}')
print(group_count)
Output : This what i understood from question.
{'g1': ['l1', 'l2'], 'g3': ['l2', 'l5'], 'g2': ['l4', 'l5']}
Upvotes: 0
Reputation: 1012
My solutions:
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
# return same elements of two lists
def sub_eq_lists(l1, l2):
return list(set([x for x in l1 + l3 if x in l1 and x in l2]))
all_lists = [l1, l2, l3, l4, l5, l6]
results = []
for j in all_lists:
for l in all_lists:
if(l!=j):
same_elements = sub_eq_lists(j,l)
if (len(same_elements) > 2) and (len(same_elements) < 11):# your conditions
results.append(same_elements)
# count occurrences
results = {str(item): results.count(item) for item in results}
results
Output:
{"['apple', 'orange', 'banana']": 2,
"['apple', 'orange', 'grape']": 2,
"['apple', 'kiwi', 'grape']": 2}
Upvotes: 0
Reputation: 1826
You could do something like this using a powerset of the combinations, obviously excluding the ones that are not in the range 3-5
lists.
Note: For the
powerset
function i give credit to this other answer from martineu:
Using itertools.chain.from_iterable
and itertools.combinations
to get the combinations.
itertools.chain.from_iterable(iterable)
Alternate constructor for chain(). Gets chained inputs from a single iterable argument that is evaluated lazily
and using functools.reduce
to do some workaround for the intersections, I have come up with this code that think that might help:
functools.reduce(function, iterable[, initializer])
from the docs:
Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
from itertools import chain, combinations
from functools import reduce
from pprint import pprint
lsts = [['apple', 'banana', 'orange'],
['apple', 'banana', 'grape', 'lemon', 'orange'],
['banana', 'grape', 'kiwi'],
['apple', 'grape', 'kiwi', 'peach'],
['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
['chery', 'kiwi', 'pear']]
lsts = tuple(map(set, lsts))
def powerset(iterable, start=0, stop=None): #from answer https://stackoverflow.com/a/40986475/13285707
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable) # allows duplicate elements
if stop is None:
stop = len(s)+1
return chain.from_iterable(combinations(s, r) for r in range(start, stop))
groups = []
for x in powerset(lsts, 3, 5):
if 3 < len(c:=reduce(set.intersection,x)) < 5:
groups.append(c)
pprint(groups)
Output:
[{'apple', 'orange'}, {'apple', 'grape'}, {'grape', 'kiwi'}]
Upvotes: 0
Reputation: 6026
The problem you are facing is called Frequent Itemsets Mining. The 2 most popular algorithms (at least that I know) that solve this problem are:
They are very well explained in the book Data Mining. Concepts and Techniques (3rd Edition). A library that implements apriori is apyori
, and this is how you could use it:
from apyori import apriori
transactions = [['apple', 'banana', 'orange'],
['apple', 'banana', 'grape', 'lemon', 'orange'],
['banana', 'grape', 'kiwi'],
['apple', 'grape', 'kiwi', 'peach'],
['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear'],
['chery', 'kiwi', 'pear']]
results = list(map(lambda x: list(x.items),
filter(lambda x: len(x.items) >= 3,
apriori(transactions, min_support=2/len(transactions), max_length=5)
)
))
print(results)
It returns:
[['banana', 'apple', 'orange'], ['grape', 'kiwi', 'apple'], ['grape', 'apple', 'orange']]
About the arguments of the apriori
call:
min_support
is the minimum frequency a certain itemset must have. In your case, it is 2/len(transactions)
because the itemset must be present in at least 2 of the transactionsmax_length
is the maximum length of an itemsethelp(apyori.apriori)
does not mention anything like that).Upvotes: 0
Reputation: 919
I think this might help, (I gave it a degree of freedom on the number of combinations to to extract and the minimum number of elements in each given combination :
import itertools
def get_max_rep(all_lists, min_comb_len, num_el_displayed):
"""Returns 'num_el_displayed' number of elements that repeat the most in the given lists"""
# extract all the unique values of all the lists
all_elements = set([el for l in all_lists for el in l])
# build all the possible combinations starting from 'min_comb_len' number of elements
combinations = [
el for r in range(min(min_comb_len, len(all_elements)-1),len(all_elements))
for el in itertools.combinations(all_elements, r)
]
# count the number of repetitions of each combination in the given lists
out = sorted(
[(comb, sum([all(fruit in el for fruit in comb) for el in all_lists]))
for comb in combinations], key=lambda x:x[1], reverse=True
)[:num_el_displayed]
return out
To test it out (here I want the first 5
combinations that have the most repetitions and that have a minimum of 2
elements:
# testing ...
l1 = ['apple', 'banana', 'orange']
l2 = ['apple', 'banana', 'grape', 'lemon', 'orange']
l3 = ['banana', 'grape', 'kiwi']
l4 = ['apple', 'grape', 'kiwi', 'peach']
l5 = ['apple', 'blueberry', 'grape', 'kiwi', 'orange', 'pear']
l6 = ['chery', 'kiwi', 'pear']
all_lists = [l1, l2, l3, l4, l5, l6]
print(get_max_rep(all_lists, 2, 5))
output:
[(('grape', 'kiwi'), 3), (('grape', 'apple'), 3), (('orange', 'apple'), 3), (('banana', 'grape'), 2), (('banana', 'orange'), 2)]
Upvotes: 1