Reputation: 32447
I haven't been able to find an understandable explanation of how to actually use Python's itertools.groupby()
function. What I'm trying to do is this:
lxml
elementI've reviewed the documentation, but I've had trouble trying to apply them beyond a simple list of numbers.
So, how do I use of itertools.groupby()
? Is there another technique I should be using? Pointers to good "prerequisite" reading would also be appreciated.
Upvotes: 711
Views: 493289
Reputation: 89
Here is an example of how to use groupby
for a list of dictionary directly.
from itertools import groupby
items = [{'app_id': '55222702242335', 'mail_id': '4890770'},
{'app_id': '44322702242745', 'mail_id': '4890770'},
{'app_id': '80513948813781', 'mail_id': '5083772'},
{'app_id': '70514248813211', 'mail_id': '5083772'}]
items.sort(key=lambda x: x['mail_id'])
grouped_items = groupby(items, lambda x:x["mail_id"])
result = {}
for key, item in grouped_items:
result[key] = list(item)
print(result)
sample output
{'4890770': [{'app_id': '55222702242335', 'mail_id': '4890770'}, {'app_id': '44322702242745', 'mail_id': '4890770'}], '5083772': [{'app_id': '80513948813781', 'mail_id': '5083772'}, {'app_id': '70514248813211', 'mail_id': '5083772'}]}
Or use a more Pythonic way
from itertools import groupby
items = [{'app_id': '55222702242335', 'mail_id': '4890770'},
{'app_id': '44322702242745', 'mail_id': '4890770'},
{'app_id': '80513948813781', 'mail_id': '5083772'},
{'app_id': '70514248813211', 'mail_id': '5083772'}]
items.sort(key=lambda x: x['mail_id'])
result = {key: list(group) for key, group in groupby(items, key=lambda x: x['mail_id'])}
print(result)
Upvotes: 0
Reputation: 32447
IMPORTANT NOTE: You may have to sort your data first.
The part I didn't get is that in the example construction
groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
k
is the current grouping key, and g
is an iterator that you can use to iterate over the group defined by that grouping key. In other words, the groupby
iterator itself returns iterators.
Here's an example of that, using clearer variable names:
from itertools import groupby
things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
for key, group in groupby(things, lambda x: x[0]):
for thing in group:
print("A %s is a %s." % (thing[1], key))
print("")
This will give you the output:
A bear is a animal.
A duck is a animal.A cactus is a plant.
A speed boat is a vehicle.
A school bus is a vehicle.
In this example, things
is a list of tuples where the first item in each tuple is the group the second item belongs to.
The groupby()
function takes two arguments: (1) the data to group and (2) the function to group it with.
Here, lambda x: x[0]
tells groupby()
to use the first item in each tuple as the grouping key.
In the above for
statement, groupby
returns three (key, group iterator) pairs - once for each unique key. You can use the returned iterator to iterate over each individual item in that group.
Here's a slightly different example with the same data, using a list comprehension:
for key, group in groupby(things, lambda x: x[0]):
listOfThings = " and ".join([thing[1] for thing in group])
print(key + "s: " + listOfThings + ".")
This will give you the output:
animals: bear and duck.
plants: cactus.
vehicles: speed boat and school bus.
Upvotes: 891
Reputation: 810
The key thing to recognize with itertools.groupby
is that items are only grouped together as long as they're sequential in the iterable. This is why sorting works, because basically you're rearranging the collection so that all of the items which satisfy callback(item)
now appear in the sorted collection sequentially.
That being said, you don't need to sort the list, you just need a collection of key-value pairs, where the value can grow in accordance to each group iterable yielded by groupby
. i.e. a dict of lists.
>>> things = [("vehicle", "bear"), ("animal", "duck"), ("animal", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
>>> coll = {}
>>> for k, g in itertools.groupby(things, lambda x: x[0]):
... coll.setdefault(k, []).extend(i for _, i in g)
...
{'vehicle': ['bear', 'speed boat', 'school bus'], 'animal': ['duck', 'cactus']}
Upvotes: 2
Reputation: 618
from random import randint
from itertools import groupby
l = [randint(1, 3) for _ in range(20)]
d = {}
for k, g in groupby(l, lambda x: x):
if not d.get(k, None):
d[k] = list(g)
else:
d[k] = d[k] + list(g)
the code above shows how groupby can be used to group a list based on the lambda function/key supplied. The only problem is that the output is not merged, this can be easily resolved using a dictionary.
Example:
l = [2, 1, 2, 3, 1, 3, 2, 1, 3, 3, 1, 3, 2, 3, 1, 2, 1, 3, 2, 3]
after applying groupby the result will be:
for k, g in groupby(l, lambda x:x):
print(k, list(g))
2 [2]
1 [1]
2 [2]
3 [3]
1 [1]
3 [3]
2 [2]
1 [1]
3 [3, 3]
1 [1]
3 [3]
2 [2]
3 [3]
1 [1]
2 [2]
1 [1]
3 [3]
2 [2]
3 [3]
Once a dictionary is used as shown above following result is derived which can be easily iterated over:
{2: [2, 2, 2, 2, 2, 2], 1: [1, 1, 1, 1, 1, 1], 3: [3, 3, 3, 3, 3, 3, 3, 3]}
Upvotes: 6
Reputation: 6128
Another example:
for key, igroup in itertools.groupby(xrange(12), lambda x: x // 5):
print key, list(igroup)
results in
0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11]
Note that igroup
is an iterator (a sub-iterator as the documentation calls it).
This is useful for chunking a generator:
def chunker(items, chunk_size):
'''Group items in chunks of chunk_size'''
for _key, group in itertools.groupby(enumerate(items), lambda x: x[0] // chunk_size):
yield (g[1] for g in group)
with open('file.txt') as fobj:
for chunk in chunker(fobj):
process(chunk)
Another example of groupby
- when the keys are not sorted. In the following example, items in xx
are grouped by values in yy
. In this case, one set of zeros is output first, followed by a set of ones, followed again by a set of zeros.
xx = range(10)
yy = [0, 0, 0, 1, 1, 1, 0, 0, 0, 0]
for group in itertools.groupby(iter(xx), lambda x: yy[x]):
print group[0], list(group[1])
Produces:
0 [0, 1, 2]
1 [3, 4, 5]
0 [6, 7, 8, 9]
Upvotes: 35
Reputation: 34803
Sadly I don’t think it’s advisable to use itertools.groupby()
. It’s just too hard to use safely, and it’s only a handful of lines to write something that works as expected.
def my_group_by(iterable, keyfunc):
"""Because itertools.groupby is tricky to use
The stdlib method requires sorting in advance, and returns iterators not
lists, and those iterators get consumed as you try to use them, throwing
everything off if you try to look at something more than once.
"""
ret = defaultdict(list)
for k in iterable:
ret[keyfunc(k)].append(k)
return dict(ret)
Use it like this:
def first_letter(x):
return x[0]
my_group_by('four score and seven years ago'.split(), first_letter)
to get
{'f': ['four'], 's': ['score', 'seven'], 'a': ['and', 'ago'], 'y': ['years']}
Upvotes: 15
Reputation: 44455
itertools.groupby
is a tool for grouping items.
From the docs, we glean further what it might do:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
groupby
objects yield key-group pairs where the group is a generator.
Features
Comparisons
# Define a printer for comparing outputs
>>> def print_groupby(iterable, keyfunc=None):
... for k, g in it.groupby(iterable, keyfunc):
... print("key: '{}'--> group: {}".format(k, list(g)))
# Feature A: group consecutive occurrences
>>> print_groupby("BCAACACAADBBB")
key: 'B'--> group: ['B']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'D'--> group: ['D']
key: 'B'--> group: ['B', 'B', 'B']
# Feature B: group all occurrences
>>> print_groupby(sorted("BCAACACAADBBB"))
key: 'A'--> group: ['A', 'A', 'A', 'A', 'A']
key: 'B'--> group: ['B', 'B', 'B', 'B']
key: 'C'--> group: ['C', 'C', 'C']
key: 'D'--> group: ['D']
# Feature C: group by a key function
>>> # islower = lambda s: s.islower() # equivalent
>>> def islower(s):
... """Return True if a string is lowercase, else False."""
... return s.islower()
>>> print_groupby(sorted("bCAaCacAADBbB"), keyfunc=islower)
key: 'False'--> group: ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D']
key: 'True'--> group: ['a', 'a', 'b', 'b', 'c']
Uses
Note: Several of the latter examples derive from Víctor Terrón's PyCon (talk) (Spanish), "Kung Fu at Dawn with Itertools". See also the groupby
source code written in C.
* A function where all items are passed through and compared, influencing the result. Other objects with key functions include sorted()
, max()
and min()
.
Response
# OP: Yes, you can use `groupby`, e.g.
[do_something(list(g)) for _, g in groupby(lxml_elements, criteria_func)]
Upvotes: 168
Reputation: 1904
This basic implementation helped me understand this function. Hope it helps others as well:
arr = [(1, "A"), (1, "B"), (1, "C"), (2, "D"), (2, "E"), (3, "F")]
for k,g in groupby(arr, lambda x: x[0]):
print("--", k, "--")
for tup in g:
print(tup[1]) # tup[0] == k
-- 1 --
A
B
C
-- 2 --
D
E
-- 3 --
F
Upvotes: 9
Reputation: 17795
The example on the Python docs is quite straightforward:
groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
So in your case, data is a list of nodes, keyfunc
is where the logic of your criteria function goes and then groupby()
groups the data.
You must be careful to sort the data by the criteria before you call groupby
or it won't work. groupby
method actually just iterates through a list and whenever the key changes it creates a new group.
Upvotes: 74
Reputation: 121
Sorting and groupby
from itertools import groupby
val = [{'name': 'satyajit', 'address': 'btm', 'pin': 560076},
{'name': 'Mukul', 'address': 'Silk board', 'pin': 560078},
{'name': 'Preetam', 'address': 'btm', 'pin': 560076}]
for pin, list_data in groupby(sorted(val, key=lambda k: k['pin']),lambda x: x['pin']):
... print pin
... for rec in list_data:
... print rec
...
o/p:
560076
{'name': 'satyajit', 'pin': 560076, 'address': 'btm'}
{'name': 'Preetam', 'pin': 560076, 'address': 'btm'}
560078
{'name': 'Mukul', 'pin': 560078, 'address': 'Silk board'}
Upvotes: 10
Reputation: 4992
A neato trick with groupby is to run length encoding in one line:
[(c,len(list(cgen))) for c,cgen in groupby(some_string)]
will give you a list of 2-tuples where the first element is the char and the 2nd is the number of repetitions.
Edit: Note that this is what separates itertools.groupby
from the SQL GROUP BY
semantics: itertools doesn't (and in general can't) sort the iterator in advance, so groups with the same "key" aren't merged.
Upvotes: 52
Reputation: 309
One useful example that I came across may be helpful:
from itertools import groupby
#user input
myinput = input()
#creating empty list to store output
myoutput = []
for k,g in groupby(myinput):
myoutput.append((len(list(g)),int(k)))
print(*myoutput)
Sample input: 14445221
Sample output: (1,1) (3,4) (1,5) (2,2) (1,1)
Upvotes: 5
Reputation: 3312
@CaptSolo, I tried your example, but it didn't work.
from itertools import groupby
[(c,len(list(cs))) for c,cs in groupby('Pedro Manoel')]
Output:
[('P', 1), ('e', 1), ('d', 1), ('r', 1), ('o', 1), (' ', 1), ('M', 1), ('a', 1), ('n', 1), ('o', 1), ('e', 1), ('l', 1)]
As you can see, there are two o's and two e's, but they got into separate groups. That's when I realized you need to sort the list passed to the groupby function. So, the correct usage would be:
name = list('Pedro Manoel')
name.sort()
[(c,len(list(cs))) for c,cs in groupby(name)]
Output:
[(' ', 1), ('M', 1), ('P', 1), ('a', 1), ('d', 1), ('e', 2), ('l', 1), ('n', 1), ('o', 2), ('r', 1)]
Just remembering, if the list is not sorted, the groupby function will not work!
Upvotes: 10
Reputation: 394775
How do I use Python's itertools.groupby()?
You can use groupby to group things to iterate over. You give groupby an iterable, and a optional key function/callable by which to check the items as they come out of the iterable, and it returns an iterator that gives a two-tuple of the result of the key callable and the actual items in another iterable. From the help:
groupby(iterable[, keyfunc]) -> create an iterator which returns
(key, sub-iterator) grouped by each value of key(value).
Here's an example of groupby using a coroutine to group by a count, it uses a key callable (in this case, coroutine.send
) to just spit out the count for however many iterations and a grouped sub-iterator of elements:
import itertools
def grouper(iterable, n):
def coroutine(n):
yield # queue up coroutine
for i in itertools.count():
for j in range(n):
yield i
groups = coroutine(n)
next(groups) # queue up coroutine
for c, objs in itertools.groupby(iterable, groups.send):
yield c, list(objs)
# or instead of materializing a list of objs, just:
# return itertools.groupby(iterable, groups.send)
list(grouper(range(10), 3))
prints
[(0, [0, 1, 2]), (1, [3, 4, 5]), (2, [6, 7, 8]), (3, [9])]
Upvotes: 8
Reputation: 5433
WARNING:
The syntax list(groupby(...)) won't work the way that you intend. It seems to destroy the internal iterator objects, so using
for x in list(groupby(range(10))):
print(list(x[1]))
will produce:
[]
[]
[]
[]
[]
[]
[]
[]
[]
[9]
Instead, of list(groupby(...)), try [(k, list(g)) for k,g in groupby(...)], or if you use that syntax often,
def groupbylist(*args, **kwargs):
return [(k, list(g)) for k, g in groupby(*args, **kwargs)]
and get access to the groupby functionality while avoiding those pesky (for small data) iterators all together.
Upvotes: 26
Reputation: 26335
I would like to give another example where groupby without sort is not working. Adapted from example by James Sulak
from itertools import groupby
things = [("vehicle", "bear"), ("animal", "duck"), ("animal", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
for key, group in groupby(things, lambda x: x[0]):
for thing in group:
print "A %s is a %s." % (thing[1], key)
print " "
output is
A bear is a vehicle.
A duck is a animal.
A cactus is a animal.
A speed boat is a vehicle.
A school bus is a vehicle.
there are two groups with vehicule, whereas one could expect only one group
Upvotes: 14