Reputation: 75
I have a CSV Excel file example:
Receipt Name Address Date Time Items
25007 A ABC pte ltd 4/7/2016 10:40 Cheese, Cookie, Pie
.
.
25008 B CCC pte ltd 4/7/2016 12:40 Cheese, Cookie
What is a simple way to compare the 'Items' column and find out the most common pattern of the items people buy together and display the top combinations? In this case the similar pattern is Cheese, Cookie.
Upvotes: 2
Views: 1679
Reputation: 103874
Suppose after processing the CSV file you find the list of items from the CSV file to be:
>>> items=['Cheese,Cookie,Pie', 'Cheese,Cookie,Pie', 'Cake,Cookie,Cheese',
... 'Cheese,Mousetrap,Pie', 'Cheese,Jam','Cheese','Cookie,Cheese,Mousetrap']
First determine all possible pairs:
>>> from itertools import combinations
>>> all_pairs={frozenset(t) for e in items for t in combinations(e.split(','),2)}
Then you can do:
from collections import Counter
pair_counts=Counter()
for s in items:
for pair in {frozenset(t) for t in combinations(s.split(','), 2)}:
pair_counts.update({tuple(pair):1})
>>> pair_counts
Counter({('Cheese', 'Cookie'): 4, ('Cheese', 'Pie'): 3, ('Cookie', 'Pie'): 2, ('Cheese', 'Mousetrap'): 2, ('Cookie', 'Mousetrap'): 1, ('Cheese', 'Jam'): 1, ('Mousetrap', 'Pie'): 1, ('Cake', 'Cheese'): 1, ('Cake', 'Cookie'): 1})
Which can be extended to a more general case:
max_n=max(len(e.split(',')) for e in items)
for n in range(max_n, 1, -1):
all_groups={frozenset(t) for e in items for t in combinations(e.split(','),n)}
group_counts=Counter()
for s in items:
for group in {frozenset(t) for t in combinations(s.split(','), n)}:
group_counts.update({tuple(group):1})
print 'group length: {}, most_common: {}'.format(n, group_counts.most_common())
Prints:
group length: 3, most_common: [(('Cheese', 'Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap', 'Pie'), 1), (('Cheese', 'Cookie', 'Mousetrap'), 1), (('Cake', 'Cheese', 'Cookie'), 1)]
group length: 2, most_common: [(('Cheese', 'Cookie'), 4), (('Cheese', 'Pie'), 3), (('Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap'), 2), (('Cookie', 'Mousetrap'), 1), (('Cheese', 'Jam'), 1), (('Mousetrap', 'Pie'), 1), (('Cake', 'Cheese'), 1), (('Cake', 'Cookie'), 1)]
Upvotes: 2
Reputation: 180441
Presuming you have comma separated values, you can use a frozenset of the pairings and use a Counter dict to get the counts:
from collections import Counter
import csv
with open("test.csv") as f:
next(f)
counts = Counter(frozenset(tuple(row[-1].split(",")))
for row in csv.reader(f))
print(counts.most_common())
If you want all combinations or pairs as per your updated input:
from collections import Counter
from itertools import combinations
def combs(s):
return combinations(s.split(","), 2)
import csv
with open("test.csv") as f:
next(f)
counts = Counter(frozenset(t)
for row in csv.reader(f)
for t in combs(row[-1]))
# counts -> Counter({frozenset(['Cheese', 'Cookie']): 2, frozenset(['Cheese', 'Pie']): 1, frozenset(['Cookie', 'Pie']): 1})
print(counts.most_common())
The order of the pairings is irrelevant as frozenset([1,2])
and frozenset([2,1])
would be considered the same.
If you want to consider all combinations from 2-n
:
def combs(s):
indiv_items = s.split(",")
return chain.from_iterable(combinations(indiv_items, i) for i in range(2, len(indiv_items) + 1))
import csv
with open("test.csv") as f:
next(f)
counts = Counter(frozenset(t)
for row in csv.reader(f)
for t in combs(row[-1]))
print(counts)
print(counts.most_common())
Which for:
Receipt,Name,Address,Date,Time,Items
25007,A,ABC,pte,ltd,4/7/2016,10:40,"Cheese,Cookie,Pie"
25008,B,CCC,pte,ltd,4/7/2016,12:40,"Cheese,Cookie"
25009,B,CCC,pte,ltd,4/7/2016,12:40,"Cookie,Cheese,pizza"
25010,B,CCC,pte,ltd,4/7/2016,12:40,"Pie,Cheese,pizza"
would give you:
Counter({frozenset(['Cheese', 'Cookie']): 3, frozenset(['Cheese', 'pizza']): 2, frozenset(['Cheese', 'Pie']): 2, frozenset(['Cookie', 'Pie']): 1, frozenset(['Cheese', 'Cookie', 'Pie']): 1, frozenset(['Cookie', 'pizza']): 1, frozenset(['Pie', 'pizza']): 1, frozenset(['Cheese', 'Cookie', 'pizza']): 1, frozenset(['Cheese', 'Pie', 'pizza']): 1})
[(frozenset(['Cheese', 'Cookie']), 3), (frozenset(['Cheese', 'pizza']), 2), (frozenset(['Cheese', 'Pie']), 2), (frozenset(['Cookie', 'Pie']), 1), (frozenset(['Cheese', 'Cookie', 'Pie']), 1), (frozenset(['Cookie', 'pizza']), 1), (frozenset(['Pie', 'pizza']), 1), (frozenset(['Cheese', 'Cookie', 'pizza']), 1), (frozenset(['Cheese', 'Pie', 'pizza']), 1)]
Upvotes: 0