Reputation: 537
Not sure how to title this question. I've run into a few situations where I have a list of data, maybe annotated with some property, and I want to collect them into groups.
For example, maybe I have a file like this:
some event
reading: 25.4
reading: 23.4
reading: 25.1
different event
reading: 22.3
reading: 21.1
reading: 26.0
reading: 25.2
another event
reading: 25.5
reading: 25.1
and I want to group each set of readings, splitting them on a condition (in this case, an event happening) so that I end up with a structure like
[['some event',
'reading: 25.4',
'reading: 23.4',
'reading: 25.1'],
['different event',
'reading: 22.3',
'reading: 21.1',
'reading: 26.0',
'reading: 25.2'],
['another event',
'reading: 25.5',
'reading: 25.1']]
In it's generic form, it is: Look for a condition, collect the data until that condition is true again, repeat
Right now, I'd do something like
events = []
current_event = []
for line in lines:
if is_event(line):
if current_event:
events.append(current_event)
current_event = [line]
else:
current_event.append(line)
else:
if current_event:
events.append(current_event)
def is_event(line):
return 'event' in line
which produces what I want, but it's ugly and hard to understand. I'm fairly certain there has to be a better way
My guess is that it involves some itertools wizardry, but I'm new to itertools and can't quite wrap my head around all of it.
Thanks!
I've actually gone with Steve Jessop's answer with a Grouper class. Here's what I'm doing:
class Grouper(object):
def __init__(self, condition_function):
self.count = 0
self.condition_function = condition_function
def __call__(self, line):
if self.condition_function(line):
self.count += 1
return self.count
and then using it like
event_grouper = Grouper(is_event)
result_as_iterators = (x[1] for x in itertools.groupby(lines, event_grouper))
and then to turn it into a dictionary I do
event_dictionary = [{event: readings} for event, *readings in result_as_iterators]
which gives
[
{'some event': ['reading: 25.4', 'reading: 23.4', 'reading: 25.1']},
{'different event': ['reading: 22.3','reading: 21.1','reading: 26.0','reading: 25.2']},
{'another event': ['reading: 25.5', 'reading: 25.1']}
]
Upvotes: 6
Views: 1109
Reputation: 279445
You can use the fact that functions in Python have state. This grouper function serves the same purpose as DSM's accumulate(fn(line) for line in s1)
:
def grouper(line):
if is_event(line):
grouper.count += 1
return grouper.count
grouper.count = 0
result_as_iterators = (x[1] for x in itertools.groupby(lines, grouper))
Then if you need it:
result_as_lists = [list(x) for x in result_as_iterators]
To allow for concurrent use you need a new grouper function object each time you use it (so that it has its own count). You might find it simpler to make it a class:
class Grouper(object):
def __init__(self):
self.count = 0
def __call__(self, line):
if is_event(line):
self.count += 1
return self.count
results_as_iterators = itertools.groupby(lines, Grouper())
Upvotes: 2
Reputation: 366133
With itertools.groupby
, you can easily group things based on a key, like 'event' in line
. So, as a first step:
>>> for k, g in itertools.groupby(lines, lambda line: 'event' in line):
... print(k, list(g))
Of course this doesn't put the events together with their values. I suspect you really don't want the events together with their values, but would actually prefer to have a dict of event: [values]
or a list of (event, [values])
. In which case you're nearly done. For example, to get that dict, just use the grouper recipe (or zip(*[iter(groups)]*2)
) to group into pairs, then use a dict comprehension to map either k, v
in those pairs to next(k): list(v)
.
On the other hand, if you really do want them together, it's the same steps, but with a list of [next(k)] + list(v)]
at the end.
However, if you don't actually understand groupby
well enough to turn that description into code, you should probably write something you do understand. And that's not too hard:
def groupify(lines):
event = []
for line in lines:
if 'event' in line:
if event: yield event
event = [line]
else:
event.append(line)
if event: yield event
Yes, it's 7 lines (condensable to 4 with some tricks) instead of 3 (condensable to 1 by nesting comprehensions in an ugly way), but 7 lines you understand and can debug are more useful than 3 lines of magic.
When you iterate the generator created by this function, it gives you lists of lines, like this:
>>> for event in groupify(lines):
... print(event)
This will print:
['some event', 'reading: 25.4', 'reading: 23.4', 'reading: 25.1']
['different event', 'reading: 22.3', 'reading: 21.1', 'reading: 26.0', 'reading: 25.2']
['another event', 'reading: 25.5', 'reading: 25.1']
If you want a list instead of an generator (so you can index it, or iterate over it twice), you can do the same thing you do to turn any other iterable into a list:
events = list(groupify(lines))
Upvotes: 5
Reputation: 22922
You can make your code more concise using list comprehensions:
# Load the file
lines = [l.rstrip() for l in open("test.txt") ]
# Record the line indices where events start/stop
events = [ i for i in range(len(lines)) if "event" in lines[i] ]
events.append( len(lines) ) # required to get the last event
# Group the lines into their respective events
groups = [ lines[events[i]:events[i+1]] for i in range(len(events)-1) ]
print groups
Output:
[['some event', 'reading: 25.4', 'reading: 23.4', 'reading: 25.1'],
['different event', 'reading: 22.3', 'reading: 21.1', 'reading: 26.0', 'reading: 25.2'],
['another event', 'reading: 25.5', 'reading: 25.1']]
I'm not sure how much you gain in raw readability, but it's pretty straightforward to understand with the comments.
Upvotes: 2
Reputation: 353569
I wish itertools
had a function which did what you wanted. For entertainment value, in modern Python you could do something like
from itertools import groupby, accumulate, tee
def splitter(source, fn):
s0, s1 = tee(source)
tick = accumulate(fn(line) for line in s1)
grouped = groupby(s0, lambda x: next(tick))
return (list(g) for k,g in grouped)
which gives
>>> with open("event.dat") as fp:
... s = list(splitter(fp, lambda x: x.strip().endswith("event")))
...
>>> s
[['some event\n', 'reading: 25.4\n', 'reading: 23.4\n', 'reading: 25.1\n'],
['different event\n', 'reading: 22.3\n', 'reading: 21.1\n', 'reading: 26.0\n', 'reading: 25.2\n'],
['another event\n', 'reading: 25.5\n', 'reading: 25.1']]
but to be honest I'd probably do what @abarnert did.
Upvotes: 4
Reputation: 9323
I doubt itertools (or collections) can make it clearer than this, unless the exact pattern is implemented in there somewhere.
Two things I notice:
current_event[0]
So you can skip the checking for if you have a current event, and you don't have to special-case creating it either. Additionally, since the "current" event is always the last one, we can just use a negative index to jump straight to it:
events = []
for line in lines:
if is_event(line):
events.append([])
events[-1].append(line)
def is_event(line):
return 'event' in line
Upvotes: 5