Reputation: 803
I have the following df:
What I want to do is to count the frequency of combination of elements. For example:
and so on, in other words, I need to generate something like this:
Count all the frequencies of single and combined items and only keep those both single and combined items with frequency >= n, where n is any positive integer. For this example let's say n -> {1, 2, 3, 4}.
I've been trying to use the following code:
# candidates itemsets
records = []
# generates a list of lists of products that were bought together (convert df to list of lists)
for i in range(0, num_records):
records.append([str(data.values[i,j]) for j in range(0, len(data.columns))])
# clean list (delete NaN values)
records = [[x for x in y if str(x) != 'nan'] for y in records]
OUTPUT:
[['detergent'],
['bread', 'water'],
['bread', 'umbrella', 'milk', 'diaper', 'beer'],
['detergent', 'beer', 'umbrella', 'milk'],
['cheese', 'detergent', 'diaper', 'umbrella'],
['umbrella', 'water', 'beer'],
['umbrella', 'water'],
['water', 'umbrella'],
['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella'],
['umbrella', 'cheese', 'detergent', 'water', 'beer']]
and then:
setOfItems = []
newListOfItems = []
for item in records:
if item in setOfItems:
continue
setOfItems.append(item)
temp = list(item)
occurence = records.count(item)
temp.append(occurence)
newListOfItems.append(temp)
OUTPUT:
['detergent', 1]
['bread', 'water', 1]
['bread', 'umbrella', 'milk', 'diaper', 'beer', 1]
['detergent', 'beer', 'umbrella', 'milk', 1]
['cheese', 'detergent', 'diaper', 'umbrella', 1]
['umbrella', 'water', 'beer', 1]
['umbrella', 'water', 1]
['water', 'umbrella', 1]
['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella', 1]
['umbrella', 'cheese', 'detergent', 'water', 'beer', 1]
As you can see, it is only counting the freq of the whole row (from image 1), however my expected output is the one that appears in the second image.
Upvotes: 4
Views: 2787
Reputation: 12808
Interesting problem! I am using itertools.combinations()
to generate all possible combinations and collections.Counter()
to count for every combination how often it appears:
import pandas as pd
import itertools
from collections import Counter
# create sample data
df = pd.DataFrame([
['detergent', np.nan],
['bread', 'water', None],
['bread', 'umbrella', 'milk', 'diaper', 'beer'],
['umbrella', 'water'],
['water', 'umbrella'],
['umbrella', 'water']
])
def get_all_combinations_without_nan_or_None(row):
# remove nan, None and double values
set_without_nan = {value for value in row if isinstance(value, str)}
# generate all possible combinations of the values in a row
all_combinations = []
for i in range(1, len(set_without_nan)+1):
result = list(itertools.combinations(set_without_nan, i))
all_combinations.extend(result)
return all_combinations
# get all posssible combinations of values in a row
all_rows = df.apply(get_all_combinations_without_nan_or_None, 1).values
all_rows_flatten = list(itertools.chain.from_iterable(all_rows))
# use Counter to count how many there are of each combination
count_combinations = Counter(all_rows_flatten)
Docs on collections.Counter()
:
https://docs.python.org/2/library/collections.html#collections.Counter
Docs on itertools.combinations()
:
https://docs.python.org/2/library/itertools.html#itertools.combinations
Upvotes: 4