Lucien S.
Lucien S.

Reputation: 5345

Sorting and filtering a list

I have list like this:

[['Richard', 1, 'Group A'], ['Mark', 3, 'Group A'],
 ['Alan', 4, 'Group B'], ['Dave', 3, 'Group B'],
 ['Gordon', 2, 'Group A']]

That I would like to filter so that only the lowest number (Richard has a number of 1, Mark is 3, Alan is 4, etc.) of each group is kept so that the list would look like:

[['Richard', 1, 'Group A'], ['Dave', 3, 'Group B']]

I'm sorting with the lambda key:

filteredList = sorted(list,key=lambda x: x[2])

But I'm blocked when it comes to sorting within each group and getting rid of higher ranking individuals.

Is there a simple way to achieve this in Python or should I iterate and test each row?

Upvotes: 2

Views: 185

Answers (5)

mgilson
mgilson

Reputation: 310049

This is a simple "bin and find min" problem. The first pass, we'll bin:

from collections import defaultdict
bins = defaultdict(list)
for item in input_list:
    bins[item[2]].append(item)

Now we just need to take the minimum of each of the bins:

from operator import itemgetter
get_second = itemgetter(1)
results = [min(group, key=get_second) for group in bins.values()]

Up to this point, we have an O(N) algorithm (binning happens in O(1) time for each of the N items we put in the dict) and finding the min runs over each of the items exactly once more -- so that's O(N) too...


If necessary, you can then sort the results by group name:

results.sort(key=itemgetter(2))

We can do the min step and binning step at the same time to save a bit of memory (e.g. if the input is coming from a generator and has lots of items):

from operator import itemgetter
get_second = itemgetter(1)
results = {}
for item in input_stream:
    group = item[2]
    if group not in results:
        results[group] = item
    else:
        results[group] = min(item, results[group], key=get_second)

This is effectively a different implementation of the same idea as the solution provided by @wim. To order the results when you're done (if necessary):

 ordered_results = sorted(results.values(), key=itmegetter(2))

In this way, we only keep one result for each group around. The cost is a bit of extra code complexity.

Upvotes: 3

Kasravnd
Kasravnd

Reputation: 107337

You can use collections.defaultdict in order to group your sub lists based on 3rd item then use min() function with a proper key within a list comprehension in order to get the expected result:

In [61]: from operator import itemgetter
In [62]: from collections import defaultdict
In [63]: lst = [['Richard', 1, 'Group A'], ['Mark', 3, 'Group A'], ['Alan', 4, 'Group B'], ['Dave', 3, 'Group B'], ['Gordon', 2, 'Group A']]

In [64]: d = defaultdict(list)

In [65]: for i, j, k in lst:
             d[k].append([i, j, k])
   ....:     

In [66]: [min(sub, key=itemgetter(1)) for sub in d.values()]
Out[66]: [['Dave', 3, 'Group B'], ['Richard', 1, 'Group A']]

You can do this even in a more optimized manner by passing a customized object to defaultdict(), so that it only appends new items if they have a smaller second item:

from collections import defaultdict


class MyList(list):

    def __init__(self, *args, **kwargs):
        super(MyList, self).__init__(*args, **kwargs)

    def special_append(self, arg):
        if not self:
            self.append(arg)
        elif arg[1] < self[0][1]:
            self[0] = arg

Demo:

lst = [['Richard', 1, 'Group A'], ['Mark', 3, 'Group A'], ['Alan', 4, 'Group B'], ['Dave', 3, 'Group B'], ['Gordon', 2, 'Group A']]

d = defaultdict(MyList)

for i, j, k in lst:
    d[k].special_append([i, j, k])

print(d)

defaultdict(<class '__main__.MyList'>, {'Group B': [['Dave', 3, 'Group B']], 'Group A': [['Richard', 1, 'Group A']]})

Upvotes: 3

wim
wim

Reputation: 363053

Re-key the data off of the group name. Don't name your data list because it shadows a built-in name.

>>> results = {}
>>> for name, number, group in data:
...     key = group
...     value = number, name
...     results[key] = min(value, results.get(key, value))
...     
>>> [[name, number, group] for group, (number, name) in results.items()]
[['Dave', 3, 'Group B'], ['Richard', 1, 'Group A']]

Pure python data structures handle this problem quite nicely, the sort/itertools approach is suboptimal and increases complexity from O(n) to O(n logn).

Upvotes: 4

Adam Smith
Adam Smith

Reputation: 54223

I agree with TemporalWolf in the comments that itertools.groupby is the right approach to take.

from itertools import groupby
from operator import itemgetter

in_ = [['Richard', 1, 'Group A'], ['Mark', 3, 'Group A'],
       ['Alan', 4, 'Group B'], ['Dave', 3, 'Group B'],
       ['Gordon', 2, 'Group A']]

groups = groupby(in_, key=itemgetter(2))
# operator.itemgetter(N) is equivalent to lambda x: x[N]

The groupby function creates something akin to:

[("Group A", [['Richard', 1, 'Group A'],
              ['Mark', 3, 'Group A'],
              ['Gordon', 2, 'Group A']]),
 ("Group B", [['Alan', 4, 'Group B'],
              ['Dave', 3, 'Group B']])]

Then it's easy enough to iterate and use min to find the results

minimums = []
for _, vals in groups:
    minimums.append(min(vals, key=itemgetter(1)))

Upvotes: 0

blue note
blue note

Reputation: 29089

You could make it something like key=lambda x: (x[2], x[1]). Then you have two-level sorting.

Alternatively, operator.itemgetter can take multiple indices.

Upvotes: 0

Related Questions