Reputation: 774
I have following data frame,
data = pd.read_csv('sample.csv', sep=',')
I need to search the frequency of itemsets present in a set. For example:
itemsets = {(143, 157), (143, 166), (175, 178), (175, 190)}
This should search the frequency of each tuple in the data frame (Trying to implement Apriori's algorithm). I'm particularly having trouble with how to individually address the tuples in the data frame and to search the tuple instead of individual entries in the data.
Update-1
For example data frame is like this:
39, 120, 124, 205, 401, 581, 704, 814, 825, 834
35, 39, 205, 712, 733, 759, 854, 950
39, 422, 449, 704, 825, 857, 895, 937, 954, 964
Update-2
Function should increment the count for a tuple only if all the values in that tuple are present in a particular row.
For example, if I search for (39, 205)
, it should return the frequency of 2 because 2 of the rows include both 39
and 205
(the first and second rows).
Upvotes: 0
Views: 910
Reputation: 1570
First of all, since there's some misunderstanding about what the question is, this answer answers the question "How to count the number of rows in which every item in the item set appears at least once?".
for each row
in the data frame, we can decide if it's counted in the frequency using
all(item in row for item in items)
where items
is an item set, for example, (39, 205)
.
We can iterate over all the rows using DataFrame.itertuples
, so for every item set items
, its frequency is
sum(1 for row in map(set, df.itertuples(name=None)) if all(item in row for item in items))
(We use map(set, ...)
to turn the tuples into sets, this is not needed but it improves efficiency)
Finally, we iterate over all the item sets in itemsets
and store the result in a dictionary where the keys are the item sets and the values are the frequencies:
{items: sum(1 for row in map(set, df.itertuples(name=None)) if all(item in row for item in items)) for items in itemsets}
{(39, 205): 2}
If you didn't like the one-line version, you can expand the algorithm into several lines like so:
d = {} # output dictionary
for items in itemsets:
frequency = 0
for row in df.itertuples(name=None):
row = set(row) # done for efficiency
for item in items:
if item not in row:
break
else: # no break
frequency += 1
d[items] = frequency
Additional information about for ... else
can be found in this answer
Upvotes: 0
Reputation: 770
This function will returns a dictionary which contains the occurrences of the tuple's count in the entire rows of the data frame.
from collections import defaultdict
def count(df, sequence):
dict_data = defaultdict(int)
shape = df.shape[0]
for items in sequence:
for row in range(shape):
dict_data[items] += all([item in df.iloc[row, :].values for item in items])
return dict_data
You can pass in the data frame and the set to the count()
function and it will return the occurrences of the tuples in the entire rows of the data frame for you i.e
>>> count(data, itemsets)
defaultdict(<class 'int'>, {(39, 205): 2})
And you can easily change it from defaultdict
to dictionary by using the dict()
method i.e.
>>> dict(count(data, itemsets))
{(39, 205): 2}
But both of them still works the same.
Upvotes: 1
Reputation: 1950
itemsets = {(39, 205),(39, 205, 401), (143, 157), (143, 166), (175, 178), (175, 190)}
x = [[39,120,124,205,401,581,704,814,825,834],
[35,39,205,712,733,759,854,950],
[39,422,449,704,825,857,895,937,954,964]]
data = pd.DataFrame(x)
for itemset in itemsets:
print(itemset)
count = 0
for i in range(len(data)):
flag = True
for item in itemset:
if item not in data.loc[i].value_counts():
flag = False
if flag:
count += 1
print(count)
Edited to take into account abstract itemset lengths, as suggested in the comments (many thanks for the useful insights).
Upvotes: 0