finding pattern within csv file

Question

I have a CSV Excel file example:

Receipt Name    Address      Date       Time    Items
25007   A      ABC pte ltd   4/7/2016   10:40   Cheese, Cookie, Pie
.
.
25008   B      CCC pte ltd   4/7/2016   12:40   Cheese, Cookie

What is a simple way to compare the 'Items' column and find out the most common pattern of the items people buy together and display the top combinations? In this case the similar pattern is Cheese, Cookie.

Padraic Cunningham · Accepted Answer

Presuming you have comma separated values, you can use a frozenset of the pairings and use a Counter dict to get the counts:

from collections import Counter
import csv

with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(tuple(row[-1].split(",")))
                     for row in csv.reader(f))
    print(counts.most_common())

If you want all combinations or pairs as per your updated input:

from collections import Counter
from itertools import combinations

def combs(s):
    return  combinations(s.split(","), 2)

import csv
with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(t)
                     for row in csv.reader(f)
                            for t in combs(row[-1]))
    # counts -> Counter({frozenset(['Cheese', 'Cookie']): 2, frozenset(['Cheese', 'Pie']): 1, frozenset(['Cookie', 'Pie']): 1})
    print(counts.most_common())

The order of the pairings is irrelevant as frozenset([1,2]) and frozenset([2,1]) would be considered the same.

If you want to consider all combinations from 2-n:

def combs(s):
    indiv_items = s.split(",")
    return chain.from_iterable(combinations(indiv_items, i) for i in range(2, len(indiv_items) + 1))


import csv

with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(t)
                     for row in csv.reader(f)
                         for t in combs(row[-1]))
    print(counts)
    print(counts.most_common())

Which for:

Receipt,Name,Address,Date,Time,Items
25007,A,ABC,pte,ltd,4/7/2016,10:40,"Cheese,Cookie,Pie"
25008,B,CCC,pte,ltd,4/7/2016,12:40,"Cheese,Cookie"
25009,B,CCC,pte,ltd,4/7/2016,12:40,"Cookie,Cheese,pizza"
25010,B,CCC,pte,ltd,4/7/2016,12:40,"Pie,Cheese,pizza"

would give you:

Counter({frozenset(['Cheese', 'Cookie']): 3, frozenset(['Cheese', 'pizza']): 2, frozenset(['Cheese', 'Pie']): 2, frozenset(['Cookie', 'Pie']): 1, frozenset(['Cheese', 'Cookie', 'Pie']): 1, frozenset(['Cookie', 'pizza']): 1, frozenset(['Pie', 'pizza']): 1, frozenset(['Cheese', 'Cookie', 'pizza']): 1, frozenset(['Cheese', 'Pie', 'pizza']): 1})
[(frozenset(['Cheese', 'Cookie']), 3), (frozenset(['Cheese', 'pizza']), 2), (frozenset(['Cheese', 'Pie']), 2), (frozenset(['Cookie', 'Pie']), 1), (frozenset(['Cheese', 'Cookie', 'Pie']), 1), (frozenset(['Cookie', 'pizza']), 1), (frozenset(['Pie', 'pizza']), 1), (frozenset(['Cheese', 'Cookie', 'pizza']), 1), (frozenset(['Cheese', 'Pie', 'pizza']), 1)]

finding pattern within csv file

Answers (2)

Related Questions