Cooper
Cooper

Reputation: 83

How to scan the candidate itemset by using the item matrix

I am doing a small data mining project and I encountered a problem that is, to scan the 'item matrix' and count the occurrence of each candidate itemset.

This is the what candidate itemsets look like. It is a list of several frozensets.
[{'🌭', '🍔', '🍕'},
 {'🍆', '🍉', '🍑'},
 {'🍆', '🍊', '🍑'},
 {'🌭', '🍔', '🍦'},
 {'🌭', '🌮', '🍕'}]

And below is the item matrix that I obtained. For every candidate in my candidate itemset, I need to check whether it is a subset of each row of the item matrix. In other words, I have to count the number of occurrence of each candidate itemset per row and sum it up. enter image description here

I have tried to run for loops that is: for each row of the matrix, I check every candidate of whether any one is a subset of that row. If it is, then count +1. However, I am not able to make it with dictionary since set is unhashable. And now I am kind of frustrated about this problem.

To make the example reproducible, I changed the emoji to strings.

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

candidate_set = [{'Apple', 'Milk'}, {'Eggs', 'Milk'}, {'Onion', 'Yogurt'}]

To find how many times in total, for example 'Apple' and 'Milk', are true in in every single row.

Any help would be appreciated! Thanks

Upvotes: 1

Views: 80

Answers (1)

Laurent
Laurent

Reputation: 13458

Here is a short, reproducible example of one way to do it:

import pandas as pd

df = pd.DataFrame(
    {
        "Apple": [True, False, False, False, True],
        "Milk": [True, True, False, False, True],
        "Eggs": [False, False, True, True, True],
        "Onion": [True, False, True, False, True],
        "Yogurt": [False, False, True, False, False],
    }
)

candidate_set = [{"Apple", "Milk"}, {"Eggs", "Milk"}, {"Onion", "Yogurt"}]
counts = {
    tuple(pair): df.loc[df[sorted(pair)[0]] & df[sorted(pair)[1]], :].shape[0]
    for pair in candidate_set
}

print(f"{counts=}")
# Output
counts={('Apple', 'Milk'): 2, ('Milk', 'Eggs'): 1, ('Yogurt', 'Onion'): 1}

Upvotes: 0

Related Questions