Reputation: 83
I am doing a small data mining project and I encountered a problem that is, to scan the 'item matrix' and count the occurrence of each candidate itemset.
This is the what candidate itemsets look like. It is a list of several frozensets.
[{'🌭', '🍔', '🍕'},
{'🍆', '🍉', '🍑'},
{'🍆', '🍊', '🍑'},
{'🌭', '🍔', '🍦'},
{'🌭', '🌮', '🍕'}]
And below is the item matrix that I obtained. For every candidate in my candidate itemset, I need to check whether it is a subset of each row of the item matrix. In other words, I have to count the number of occurrence of each candidate itemset per row and sum it up.
I have tried to run for loops that is: for each row of the matrix, I check every candidate of whether any one is a subset of that row. If it is, then count +1. However, I am not able to make it with dictionary since set is unhashable. And now I am kind of frustrated about this problem.
To make the example reproducible, I changed the emoji to strings.
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
candidate_set = [{'Apple', 'Milk'}, {'Eggs', 'Milk'}, {'Onion', 'Yogurt'}]
To find how many times in total, for example 'Apple' and 'Milk', are true in in every single row.
Any help would be appreciated! Thanks
Upvotes: 1
Views: 80
Reputation: 13458
Here is a short, reproducible example of one way to do it:
import pandas as pd
df = pd.DataFrame(
{
"Apple": [True, False, False, False, True],
"Milk": [True, True, False, False, True],
"Eggs": [False, False, True, True, True],
"Onion": [True, False, True, False, True],
"Yogurt": [False, False, True, False, False],
}
)
candidate_set = [{"Apple", "Milk"}, {"Eggs", "Milk"}, {"Onion", "Yogurt"}]
counts = {
tuple(pair): df.loc[df[sorted(pair)[0]] & df[sorted(pair)[1]], :].shape[0]
for pair in candidate_set
}
print(f"{counts=}")
# Output
counts={('Apple', 'Milk'): 2, ('Milk', 'Eggs'): 1, ('Yogurt', 'Onion'): 1}
Upvotes: 0