Darryl Dan
Darryl Dan

Reputation: 75

finding pattern within csv file

I have a CSV Excel file example:

Receipt Name    Address      Date       Time    Items
25007   A      ABC pte ltd   4/7/2016   10:40   Cheese, Cookie, Pie
.
.
25008   B      CCC pte ltd   4/7/2016   12:40   Cheese, Cookie

What is a simple way to compare the 'Items' column and find out the most common pattern of the items people buy together and display the top combinations? In this case the similar pattern is Cheese, Cookie.

Upvotes: 2

Views: 1679

Answers (2)

dawg
dawg

Reputation: 103874

Suppose after processing the CSV file you find the list of items from the CSV file to be:

>>> items=['Cheese,Cookie,Pie', 'Cheese,Cookie,Pie', 'Cake,Cookie,Cheese', 
... 'Cheese,Mousetrap,Pie', 'Cheese,Jam','Cheese','Cookie,Cheese,Mousetrap']

First determine all possible pairs:

>>> from itertools import combinations
>>> all_pairs={frozenset(t) for e in items for t in combinations(e.split(','),2)}

Then you can do:

from collections import Counter
pair_counts=Counter()
for s in items:
    for pair in {frozenset(t) for t in combinations(s.split(','), 2)}:
        pair_counts.update({tuple(pair):1})

>>> pair_counts
Counter({('Cheese', 'Cookie'): 4, ('Cheese', 'Pie'): 3, ('Cookie', 'Pie'): 2, ('Cheese', 'Mousetrap'): 2, ('Cookie', 'Mousetrap'): 1, ('Cheese', 'Jam'): 1, ('Mousetrap', 'Pie'): 1, ('Cake', 'Cheese'): 1, ('Cake', 'Cookie'): 1})

Which can be extended to a more general case:

max_n=max(len(e.split(',')) for e in items)
for n in range(max_n, 1, -1):
    all_groups={frozenset(t) for e in items for t in combinations(e.split(','),n)}
    group_counts=Counter()
    for s in items:
        for group in {frozenset(t) for t in combinations(s.split(','), n)}:
            group_counts.update({tuple(group):1})      
    print 'group length: {}, most_common: {}'.format(n, group_counts.most_common())     

Prints:

group length: 3, most_common: [(('Cheese', 'Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap', 'Pie'), 1), (('Cheese', 'Cookie', 'Mousetrap'), 1), (('Cake', 'Cheese', 'Cookie'), 1)]
group length: 2, most_common: [(('Cheese', 'Cookie'), 4), (('Cheese', 'Pie'), 3), (('Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap'), 2), (('Cookie', 'Mousetrap'), 1), (('Cheese', 'Jam'), 1), (('Mousetrap', 'Pie'), 1), (('Cake', 'Cheese'), 1), (('Cake', 'Cookie'), 1)]

Upvotes: 2

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

Presuming you have comma separated values, you can use a frozenset of the pairings and use a Counter dict to get the counts:

from collections import Counter
import csv

with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(tuple(row[-1].split(",")))
                     for row in csv.reader(f))
    print(counts.most_common())

If you want all combinations or pairs as per your updated input:

from collections import Counter
from itertools import combinations

def combs(s):
    return  combinations(s.split(","), 2)

import csv
with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(t)
                     for row in csv.reader(f)
                            for t in combs(row[-1]))
    # counts -> Counter({frozenset(['Cheese', 'Cookie']): 2, frozenset(['Cheese', 'Pie']): 1, frozenset(['Cookie', 'Pie']): 1})
    print(counts.most_common())

The order of the pairings is irrelevant as frozenset([1,2]) and frozenset([2,1]) would be considered the same.

If you want to consider all combinations from 2-n:

def combs(s):
    indiv_items = s.split(",")
    return chain.from_iterable(combinations(indiv_items, i) for i in range(2, len(indiv_items) + 1))


import csv

with open("test.csv") as f:
    next(f)
    counts = Counter(frozenset(t)
                     for row in csv.reader(f)
                         for t in combs(row[-1]))
    print(counts)
    print(counts.most_common())

Which for:

Receipt,Name,Address,Date,Time,Items
25007,A,ABC,pte,ltd,4/7/2016,10:40,"Cheese,Cookie,Pie"
25008,B,CCC,pte,ltd,4/7/2016,12:40,"Cheese,Cookie"
25009,B,CCC,pte,ltd,4/7/2016,12:40,"Cookie,Cheese,pizza"
25010,B,CCC,pte,ltd,4/7/2016,12:40,"Pie,Cheese,pizza"

would give you:

Counter({frozenset(['Cheese', 'Cookie']): 3, frozenset(['Cheese', 'pizza']): 2, frozenset(['Cheese', 'Pie']): 2, frozenset(['Cookie', 'Pie']): 1, frozenset(['Cheese', 'Cookie', 'Pie']): 1, frozenset(['Cookie', 'pizza']): 1, frozenset(['Pie', 'pizza']): 1, frozenset(['Cheese', 'Cookie', 'pizza']): 1, frozenset(['Cheese', 'Pie', 'pizza']): 1})
[(frozenset(['Cheese', 'Cookie']), 3), (frozenset(['Cheese', 'pizza']), 2), (frozenset(['Cheese', 'Pie']), 2), (frozenset(['Cookie', 'Pie']), 1), (frozenset(['Cheese', 'Cookie', 'Pie']), 1), (frozenset(['Cookie', 'pizza']), 1), (frozenset(['Pie', 'pizza']), 1), (frozenset(['Cheese', 'Cookie', 'pizza']), 1), (frozenset(['Cheese', 'Pie', 'pizza']), 1)]

Upvotes: 0

Related Questions