Reputation: 43246
I have some data stored in a list that I would like to group based on a value.
For example, if my data is
data = [(1, 'a'), (2, 'x'), (1, 'b')]
and I want to group it by the first value in each tuple to get
result = [(1, 'ab'), (2, 'x')]
how would I go about it?
More generally, what's the recommended way to group data in python? Is there a recipe that can help me?
Upvotes: 6
Views: 610
Reputation: 164773
This is inefficient compared to the dict
and groupby
solutions.
However, for small lists where performance is not a concern, you can perform a list comprehension which parses the list for each unique identifier.
res = [(i, ''.join([j[1] for j in data if j[0] == i]))
for i in set(list(zip(*data))[0])]
[(1, 'ab'), (2, 'x')]
The solution can be split into 2 parts:
set(list(zip(*data))[0])
extracts the unique set of identifiers which we iterate via a for
loop within the list comprehension.(i, ''.join([j[1] for j in data if j[0] == i]))
applies the logic we require for the desired output.Upvotes: -3
Reputation: 164773
This isn't a recipe as such, but an intuitive and flexible way to group data using a function. In this case, the function is str.join
.
import pandas as pd
data = [(1, 'a'), (2, 'x'), (1, 'b')]
# create dataframe from list of tuples
df = pd.DataFrame(data)
# group by first item and apply str.join
grp = df.groupby(0)[1].apply(''.join)
# create list of tuples from index and value
res = list(zip(grp.index, grp))
print(res)
[(1, 'ab'), (2, 'x')]
Advantages
list
output at the end of a sequence of vectorisable steps.''.join
to list
or other reducing function.Disadvantages
list
-> pd.DataFrame
-> list
conversion.Upvotes: 0
Reputation: 43246
The go-to data structure to use for all kinds of grouping is the dict. The idea is to use something that uniquely identifies a group as the dict's keys, and store all values that belong to the same group under the same key.
As an example, your data could be stored in a dict like this:
{1: ['a', 'b'],
2: ['x']}
The integer that you're using to group the values is used as the dict key, and the values are aggregated in a list.
The reason why we're using a dict is because it can map keys to values in constant O(1) time. This makes the grouping process very efficient and also very easy. The general structure of the code will always be the same for all kinds of grouping tasks: You iterate over your data and gradually fill a dict with grouped values. Using a defaultdict
instead of a regular dict makes the whole process even easier, because we don't have to worry about initializing the dict with empty lists.
import collections
groupdict = collections.defaultdict(list)
for value in data:
group = value[0]
value = value[1]
groupdict[group].append(value)
# result:
# {1: ['a', 'b'],
# 2: ['x']}
Once the data is grouped, all that's left is to convert the dict to your desired output format:
result = [(key, ''.join(values)) for key, values in groupdict.items()]
# result: [(1, 'ab'), (2, 'x')]
The following section will provide recipes for different kinds of inputs and outputs, and show how to group by various things. The basis for everything is the following snippet:
import collections
groupdict = collections.defaultdict(list)
for value in data: # input
group = ??? # group identifier
value = ??? # value to add to the group
groupdict[group].append(value)
result = groupdict # output
Each of the commented lines can/has to be customized depending on your use case.
The format of your input data dictates how you iterate over it.
In this section, we're customizing the for value in data:
line of the recipe.
More often than not, all the values are stored in a flat list:
data = [value1, value2, value3, ...]
In this case we simply iterate over the list with a for
loop:
for value in data:
If you have multiple lists with each list holding the value of a different attribute like
firstnames = [firstname1, firstname2, ...]
middlenames = [middlename1, middlename2, ...]
lastnames = [lastname1, lastname2, ...]
use the zip
function to iterate over all lists simultaneously:
for value in zip(firstnames, middlenames, lastnames):
This will make value
a tuple of (firstname, middlename, lastname)
.
If you want to combine multiple dicts like
dict1 = {'a': 1, 'b': 2}
dict2 = {'b': 5}
First put them all in a list:
dicts = [dict1, dict2]
And then use two nested loops to iterate over all (key, value)
pairs:
for dict_ in dicts:
for value in dict_.items():
In this case, the value
variable will take the form of a 2-element tuple like ('a', 1)
or ('b', 2)
.
Here we'll cover various ways to extract group identifiers from your data.
In this section, we're customizing the group = ???
line of the recipe.
If your values are lists or tuples like (attr1, attr2, attr3, ...)
and you want to group them by the nth element:
group = value[n]
The syntax is the same for dicts, so if you have values like {'firstname': 'foo', 'lastname': 'bar'}
and you want to group by the first name:
group = value['firstname']
If your values are objects like datetime.date(2018, 5, 27)
and you want to group them by an attribute, like year
:
group = value.year
Sometimes you have a function that returns a value's group when it's called. For example, you could use the len
function to group values by their length:
group = len(value)
If you wish to group your data by more than a single value, you can use a tuple as the group identifier. For example, to group strings by their first letter and their length:
group = (value[0], len(value))
Because dict keys must be hashable, you will run into problems if you try to group by something that can't be hashed. In such a case, you have to find a way to convert the unhashable value to a hashable representation.
sets: Convert sets to frozensets, which are hashable:
group = frozenset(group)
dicts: Dicts can be represented as sorted (key, value)
tuples:
group = tuple(sorted(group.items()))
Sometimes you will want to modify the values you're grouping. For example, if you're grouping tuples like (1, 'a')
and (1, 'b')
by the first element, you might want to remove the first element from each tuple to get a result like {1: ['a', 'b']}
rather than {1: [(1, 'a'), (1, 'b')]}
.
In this section, we're customizing the value = ???
line of the recipe.
If you don't want to change the value in any way, simple delete the value = ???
line from your code.
If your values are lists like [1, 'a']
and you only want to keep the 'a'
:
value = value[1]
Or if they're dicts like {'firstname': 'foo', 'lastname': 'bar'}
and you only want to keep the first name:
value = value['firstname']
If your values are lists like [1, 'a', 'foo']
and [1, 'b', 'bar']
and you want to discard the first element of each tuple to get a group like [['a', 'foo], ['b', 'bar']]
, use the slicing syntax:
value = value[1:]
If your values are lists like ['foo', 'bar', 'baz']
or dicts like {'firstname': 'foo', 'middlename': 'bar', 'lastname': 'baz'}
and you want delete or keep only some of these elements, start by creating a set of elements you want to keep or delete. For example:
indices_to_keep = {0, 2}
keys_to_delete = {'firstname', 'middlename'}
Then choose the appropriate snippet from this list:
value = [val for i, val in enumerate(value) if i in indices_to_keep]
value = [val for i, val in enumerate(value) if i not in indices_to_delete]
value = {key: val for key, val in value.items() if key in keys_to_keep]
value = {key: val for key, val in value.items() if key not in keys_to_delete]
Once the grouping is complete, we have a defaultdict
filled with lists. But the desired result isn't always a (default)dict.
In this section, we're customizing the result = groupdict
line of the recipe.
To convert the defaultdict to a regular dict, simply call the dict
constructor on it:
result = dict(groupdict)
(group, value)
pairsTo get a result like [(group1, value1), (group1, value2), (group2, value3)]
from the dict {group1: [value1, value2], group2: [value3]}
, use a list comprehension:
result = [(group, value) for group, values in groupdict.items()
for value in values]
To get a result like [[value1, value2], [value3]]
from the dict {group1: [value1, value2], group2: [value3]}
, use dict.values
:
result = list(groupdict.values())
To get a result like [value1, value2, value3]
from the dict {group1: [value1, value2], group2: [value3]}
, flatten the dict with a list comprehension:
result = [value for values in groupdict.values() for value in values]
If your values are lists or other iterables like
groupdict = {group1: [[list1_value1, list1_value2], [list2_value1]]}
and you want a flattened result like
result = {group1: [list1_value1, list1_value2, list2_value1]}
you have two options:
Flatten the lists with a dict comprehension:
result = {group: [x for iterable in values for x in iterable]
for group, values in groupdict.items()}
Avoid creating a list of iterables in the first place, by using list.extend
instead of list.append
. In other words, change
groupdict[group].append(value)
to
groupdict[group].extend(value)
And then just set result = groupdict
.
Dicts are unordered data structures. If you iterate over a dict, you never know in which order its elements will be listed. If you don't care about the order, you can use the recipes shown above. But if you do care about the order, you have to sort the output accordingly.
I'll use the following dict to demonstrate how to sort your output in various ways:
groupdict = {'abc': [1], 'xy': [2, 5]}
Keep in mind that this is a bit of a meta-recipe that may need to be combined with other parts of this answer to get exactly the output you want. The general idea is to sort the dictionary keys before using them to extract the values from the dict:
groups = sorted(groupdict.keys())
# groups = ['abc', 'xy']
Keep in mind that sorted
accepts a key function in case you want to customize the sort order. For example, if the dict keys are strings and you want to sort them by length:
groups = sorted(groupdict.keys(), key=len)
# groups = ['xy', 'abc']
Once you've sorted the keys, use them to extract the values from the dict in the correct order:
# groups = ['abc', 'xy']
result = [groupdict[group] for group in groups]
# result = [[1], [2, 5]]
Remember that this can be combined with other parts of this answer to get different kinds of output. For example, if you want to keep the group identifiers:
# groups = ['abc', 'xy']
result = [(group, groupdict[group]) for group in groups]
# result = [('abc', [1]), ('xy', [2, 5])]
For your convenience, here are some commonly used sort orders:
Sort by number of values per group:
groups = sorted(groudict.keys(), key=lambda group: len(groupdict[group]))
result = [groupdict[group] for group in groups]
# result = [[2, 5], [1]]
To count the number of elements associated with each group, use the len
function:
result = {group: len(values) for group, values in groupdict.items()}
If you want to count the number of distinct elements, use set
to eliminate duplicates:
result = {group: len(set(values)) for group, values in groupdict.items()}
To demonstrate how to piece together a working solution from this recipe, let's try to turn an input of
data = [["A",0], ["B",1], ["C",0], ["D",2], ["E",2]]
into
result = [["A", "C"], ["B"], ["D", "E"]]
In other words, we're grouping lists by their 2nd element.
The first two lines of the recipe are always the same, so let's start by copying those:
import collections
groupdict = collections.defaultdict(list)
Now we have to find out how to loop over the input. Since our input is a simple list of values, a normal for
loop will suffice:
for value in data:
Next we have to extract the group identifier from the value. We're grouping by the 2nd list element, so we use indexing:
group = value[1]
The next step is to transform the value. Since we only want to keep the first element of each list, we once again use list indexing:
value = value[0]
Finally, we have to figure out how to turn the dict we generated into a list. What we want is a list of values, without the groups. We consult the Output section of the recipe to find the appropriate dict flattening snippet:
result = list(groupdict.values())
Et voilà:
data = [["A",0], ["B",1], ["C",0], ["D",2], ["E",2]]
import collections
groupdict = collections.defaultdict(list)
for value in data:
group = value[1]
value = value[0]
groupdict[group].append(value)
result = list(groupdict.values())
# result: [["A", "C"], ["B"], ["D", "E"]]
Upvotes: 7
Reputation: 18950
There is a general purpose recipe in itertools
and it's groupby()
.
A schema of this recipe can be given in this form:
[(k, aggregate(g)) for k, g in groupby(sorted(data, key=extractKey), extractKey)]
The two relevant parts to change in the recipe are:
define the grouping key (extractKey): in this case getting the first item of the tuple:
lambda x: x[0]
aggregate grouped results (if needed) (aggregate): g
contains all the matching tuples for each key k
(e.g. (1, 'a')
, (1, 'b')
for key 1
, and (2, 'x')
for key 2
), we want to take only the second item of the tuple and concatenate all of those in one string:
''.join(x[1] for x in g)
Example:
from itertools import groupby
extractKey = lambda x: x[0]
aggregate = lambda g: ''.join(x[1] for x in g)
[(k, aggregate(g)) for k, g in groupby(sorted(data, key=extractKey), extractKey)]
# [(1, 'ab'), (2, 'x')]
Sometimes, extractKey
, aggregate
, or both can be inlined into a one-liner (we omit sort key too, as that's redundant for this example):
[(k, ''.join(x[1] for x in g)) for k, g in groupby(sorted(data), lambda x: x[0])]
# [(1, 'ab'), (2, 'x')]
Comparing this recipe with the recipe using defaultdict
there are pros and cons in both cases.
groupby()
tends to be slower (about twice as slower in my tests) than the defaultdict
recipe.
On the other hand, groupby()
has advantages in the memory constrained case where the values are being produced on the fly; you can process the groups in a streaming fashion, without storing them; defaultdict
will require the memory to store all of them.
Upvotes: 1