Grouping a list of sublists by some of the values in each sublist

Question

I have a dictionary containing lists. For example,

{1: [[sender11, receiver11, text11, address11]], 
 2: [[sender21, receiver21, text21, address21], [sender22, receiver22, text22, address22]], 
 3: [[sender31, receiver31, text31, address31], [sender32, receiver32, text32, address32], [sender33, receiver33, text33, address33]]
 4: [[sender41, receiver41, text41, address41], [sender42, receiver42, text42, address42], [sender43, receiver43, text43, address43], [sender44, receiver44, text44, address44]]}

What I want to do is, for dictionary elements that contain a list with 2 or more elements (i.e. dict[2], dict[3] and dict[4] in this example), I do a comparison of the sender, receiver, text for each list value. For each group of list values with the same sender, receiver, text, I'll do something.

So for example, in dict[3], if sender31, receiver31, text31 are the same values as sender32, receiver32, text32 and sender33, receiver33, text33, then I'll do something with all 3 list values.

Say in dict[4], if sender41, receiver41, text41 are the same values as sender42, receiver42, text42, while sender43, receiver43, text43 are the same values as sender44, receiver44, text44, but different from sender41, receiver41, text41, then I'll work on these 2 groups separately.

I wrote a Python script that pretty much brute force compares the values of sender21, receiver21, text21 and sender22, receiver22, text22, i.e.

if sender21 == sender22 and receiver21 == receiver22 and text21 == text22:
   # Do something

This isn't efficient as it only works for 2 list values, but I don't know how I should implement this such that it works for any number of list values greater than 1.

schesis · Accepted Answer

I think a defaultdict is the obvious way to go here:

from collections import defaultdict

def collate(seq):
    groups = defaultdict(list)
    for subseq in seq:
        groups[tuple(subseq[:3])].append(subseq[3])
    return groups

Depending on your actual data, you might replace tuple(subseq[:3]) in the function above with e.g. (subseq[1], subseq[4], subseq[5]), or the appended subseq[3] with subseq itself ... that'll depend on what you're doing with the data.

The key has to be a tuple rather than a list, though, because keys must be immutable.

Example:

>>> data = [
...     ['S1', 'R1', 'T1', 'A3'],
...     ['S2', 'R2', 'T2', 'A4'],
...     ['S1', 'R1', 'T1', 'A5'],
...     ['S2', 'R2', 'T2', 'A6']
... ]

>>> collate(data)
defaultdict(, {
    ('S2', 'R2', 'T2'): ['A4', 'A6'],
    ('S1', 'R1', 'T1'): ['A3', 'A5']
})

You can work with this just as you would any other dictionary, e.g.

>>> for (sender, receiver, text), addresses in collate(data).items():
...     print sender, receiver, text
...     print '|'.join(addresses)
...     print
... 
S2 R2 T2
A4|A6

S1 R1 T1
A3|A5

Grouping a list of sublists by some of the values in each sublist

Answers (1)

Related Questions