Reputation: 10709

Sorting sub-lists into new sub-lists based on common first items

I have a large number of two-membered sub-lists that are members of a list called mylist:

mylist = [['AB001', 22100],
          ['AB001', 32935],
          ['XC013', 99834],
          ['VD126', 18884],
          ['AB001', 34439],
          ['XC013', 86701]]

I want to sort mylist into new sub-lists based on whether the sub-lists contain the same string as the first item. For example, this is what I am looking for my code to output:

newlist = [['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
          [['XC013', 99834], ['XC013', 86701]],
          [['VD126', 18884]]

Here is how I was trying to code this:

mylist = sorted(mylist)
newlist = []
for sublist in mylist:
    id = sublist[0]
if id == next.id:
    newlist.append(id)
print newlist

I was also trying to understand if itertools.groupby() was the correct tool for this problem. Can someone help me with this problem?

Upvotes: 4

Answers (4)

Trenton McKinney

Reputation: 62513

Without importing any packages:

Build a dictionary and then get the values to a list
- Use .get to determine if a key exists, and return some specified value, None in this case, if the key is nonexistent.
- dict.get defaults to None, so this method never raises a KeyError.
  - If None is a value in the dictionary, then change the default value returned by .get.
    - test.get(t[0], 'something here')
Because: sub-lists based on common first items.
- Add index 0 as the key, and then add the list, t, as the dict value.

test = dict()
for t in mylist:
    if test.get(t[0]) == None:
        test[t[0]] = [t]
    else:
        test[t[0]].append(t)
        
final = list(test.values())

# print final results in

[[['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
 [['XC013', 99834], ['XC013', 86701]],
 [['VD126', 18884]]]

Upvotes: 0

norok2

Reputation: 26916

There are a number of alternatives to solve this problem:

def regroup_by_di(items, key=None):
    result = {}
    callable_key = callable(key)
    for item in items:
        key_value = key(item) if callable_key else item
        if key_value not in result:
            result[key_value] = []
        result[key_value].append(item)
    return result

import collections


def regroup_by_dd(items, key=None):
    result = collections.defaultdict(list)
    callable_key = callable(key)
    for item in items:
        result[key(item) if callable_key else item].append(item)
    return dict(result)  # to be in line with other solutions

def regroup_by_sd(items, key=None):
    result = {}
    callable_key = callable(key)
    for item in items:
        key_value = key(item) if callable_key else item
        result.setdefault(key_value, []).append(item)
    return result

import itertools


def regroup_by_it(items, key=None):
    seq = sorted(items, key=key)
    result = {
        key_value: list(group)
        for key_value, group in itertools.groupby(seq, key)}
    return result

def group_by(
        seq,
        key=None):
    items = iter(seq)
    try:
        item = next(items)
    except StopIteration:
        return
    else:
        callable_key = callable(key)
        last = key(item) if callable_key else item
        i = j = 0
        for i, item in enumerate(items, 1):
            current = key(item) if callable_key else item
            if last != current:
                yield last, seq[j:i]
                last = current
                j = i
        if i >= j:
            yield last, seq[j:i + 1]


def regroup_by_gb(items, key=None):
    return dict(group_by(sorted(items, key=key), key))

These can be divided into two categories:

loop through the input creating a dict-like structure (regroup_by_di(), regroup_by_dd(), regroup_by_sd())
sorting the input and then use a uniq-like function (e.g. itertools.groupby()) (regroup_by_it(), regroup_by_gb())

The first class of approaches has O(n) computational complexity, while the second class of approaches has O(n log n).

All of the proposed approach require specifying a key. For OP's problem, operators.itemgetter(0) or lambda x: x[0] would work. Additionally, to get OP's desired results one should get only the list(dict.values()), e.g.:

from operator import itemgetter


mylist = [['AB001', 22100],
          ['AB001', 32935],
          ['XC013', 99834],
          ['VD126', 18884],
          ['AB001', 4439],
          ['XC013', 86701]]


print(list(regroup_by_di(mylist, key=itemgetter(0)).values()))
# [[['AB001', 22100], ['AB001', 32935], ['AB001', 4439]], [['XC013', 99834], ['XC013', 86701]], [['VD126', 18884]]]

The timings come out as faster for all dict-based (1st class) solutions and slower for all groupby-based (2nd class) solutions. Within the dict-based solutions, their performances will depend slightly on the "collision rate", which is proportional to the number of times a new item will create a new object. For higher collision rates, the regroup_by_di() may be the fastest, while for lower collision rates the regroup_by_dd() may be the fastest.

The benchmarks come out as follow:

0.1% collision rate (approx. 1000 elements per group)

10% collision rate (approx. 10 elements per group)

50% collision rate (approx. 2 elements per group)

100% collision rate (approx. 1 element per group)

(More details available here.)

Upvotes: 2

agf

Reputation: 176940

You were right about this being a job for groupby:

from itertools import groupby
from operator import itemgetter

mylist = [['AB001', 22100],
          ['AB001', 32935],
          ['XC013', 99834],
          ['VD126', 18884],
          ['AB001', 4439],
          ['XC013', 86701]]

print([list(value) for key, value in groupby(sorted(mylist), key=itemgetter(0))])

This will give you a list-of-lists, grouped by the first item in the sublist.

[[['AB001', 4439], ['AB001', 22100], ['AB001', 32935]], 
 [['VD126', 18884]], 
 [['XC013', 86701], ['XC013', 99834]]]

Upvotes: 6

jpp

Reputation: 164783

`collections.defaultdict`

An itertools.groupby solution will incur O(n log n) cost since the input must be sorted first. You can a use defaultdict of lists for a guaranteed O(n) solution:

from collections import defaultdict

dd = defaultdict(list)
for item in mylist:
    dd[item[0]].append(item)

res = list(dd.values())

print(res)

[[['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
 [['XC013', 99834], ['XC013', 86701]],
 [['VD126', 18884]]]

Upvotes: 3

Sorting sub-lists into new sub-lists based on common first items

Answers (4)

Without importing any packages:

collections.defaultdict

Related Questions

`collections.defaultdict`