Joe Pinsonault
Joe Pinsonault

Reputation: 537

How to collect data from a list into groups based on condition?

Not sure how to title this question. I've run into a few situations where I have a list of data, maybe annotated with some property, and I want to collect them into groups.

For example, maybe I have a file like this:

some event
reading: 25.4
reading: 23.4
reading: 25.1
different event
reading: 22.3
reading: 21.1
reading: 26.0
reading: 25.2
another event
reading: 25.5
reading: 25.1

and I want to group each set of readings, splitting them on a condition (in this case, an event happening) so that I end up with a structure like

[['some event',
  'reading: 25.4',
  'reading: 23.4',
  'reading: 25.1'],
 ['different event',
  'reading: 22.3',
  'reading: 21.1',
  'reading: 26.0',
  'reading: 25.2'],
 ['another event',
  'reading: 25.5',
  'reading: 25.1']]

In it's generic form, it is: Look for a condition, collect the data until that condition is true again, repeat

Right now, I'd do something like

events = []
current_event = []

for line in lines:
    if is_event(line):
        if current_event:
            events.append(current_event)
        current_event = [line]

    else:
        current_event.append(line)
else:
    if current_event:
        events.append(current_event)


def is_event(line):
    return 'event' in line

which produces what I want, but it's ugly and hard to understand. I'm fairly certain there has to be a better way

My guess is that it involves some itertools wizardry, but I'm new to itertools and can't quite wrap my head around all of it.

Thanks!

Update

I've actually gone with Steve Jessop's answer with a Grouper class. Here's what I'm doing:

class Grouper(object):
    def __init__(self, condition_function):
        self.count = 0
        self.condition_function = condition_function

    def __call__(self, line):
        if self.condition_function(line):
            self.count += 1
        return self.count

and then using it like

event_grouper = Grouper(is_event)
result_as_iterators = (x[1] for x in itertools.groupby(lines, event_grouper))

and then to turn it into a dictionary I do

event_dictionary = [{event: readings} for event, *readings in result_as_iterators]

which gives

[
 {'some event': ['reading: 25.4', 'reading: 23.4', 'reading: 25.1']},
 {'different event': ['reading: 22.3','reading: 21.1','reading: 26.0','reading: 25.2']},
 {'another event': ['reading: 25.5', 'reading: 25.1']}
]

Upvotes: 6

Views: 1109

Answers (5)

Steve Jessop
Steve Jessop

Reputation: 279445

You can use the fact that functions in Python have state. This grouper function serves the same purpose as DSM's accumulate(fn(line) for line in s1):

def grouper(line):
    if is_event(line):
        grouper.count += 1
    return grouper.count
grouper.count = 0

result_as_iterators = (x[1] for x in itertools.groupby(lines, grouper))

Then if you need it:

result_as_lists = [list(x) for x in result_as_iterators]

To allow for concurrent use you need a new grouper function object each time you use it (so that it has its own count). You might find it simpler to make it a class:

class Grouper(object):
    def __init__(self):
        self.count = 0
    def __call__(self, line):
        if is_event(line):
            self.count += 1
        return self.count

results_as_iterators = itertools.groupby(lines, Grouper())

Upvotes: 2

abarnert
abarnert

Reputation: 366133

With itertools.groupby, you can easily group things based on a key, like 'event' in line. So, as a first step:

>>> for k, g in itertools.groupby(lines, lambda line: 'event' in line):
...     print(k, list(g))

Of course this doesn't put the events together with their values. I suspect you really don't want the events together with their values, but would actually prefer to have a dict of event: [values] or a list of (event, [values]). In which case you're nearly done. For example, to get that dict, just use the grouper recipe (or zip(*[iter(groups)]*2)) to group into pairs, then use a dict comprehension to map either k, v in those pairs to next(k): list(v).

On the other hand, if you really do want them together, it's the same steps, but with a list of [next(k)] + list(v)] at the end.

However, if you don't actually understand groupby well enough to turn that description into code, you should probably write something you do understand. And that's not too hard:

def groupify(lines):
    event = []
    for line in lines:
        if 'event' in line:
            if event: yield event
            event = [line]
        else:
            event.append(line)
    if event: yield event

Yes, it's 7 lines (condensable to 4 with some tricks) instead of 3 (condensable to 1 by nesting comprehensions in an ugly way), but 7 lines you understand and can debug are more useful than 3 lines of magic.

When you iterate the generator created by this function, it gives you lists of lines, like this:

>>> for event in groupify(lines):
...     print(event)

This will print:

['some event', 'reading: 25.4', 'reading: 23.4', 'reading: 25.1']
['different event', 'reading: 22.3', 'reading: 21.1', 'reading: 26.0', 'reading: 25.2']
['another event', 'reading: 25.5', 'reading: 25.1']

If you want a list instead of an generator (so you can index it, or iterate over it twice), you can do the same thing you do to turn any other iterable into a list:

events = list(groupify(lines))

Upvotes: 5

mdml
mdml

Reputation: 22922

You can make your code more concise using list comprehensions:

# Load the file
lines  = [l.rstrip() for l in open("test.txt") ]

# Record the line indices where events start/stop
events = [ i for i in range(len(lines)) if "event" in lines[i] ]
events.append( len(lines) ) # required to get the last event

# Group the lines into their respective events
groups = [ lines[events[i]:events[i+1]] for i in range(len(events)-1) ]
print groups

Output:

[['some event', 'reading: 25.4', 'reading: 23.4', 'reading: 25.1'],
 ['different event', 'reading: 22.3', 'reading: 21.1', 'reading: 26.0', 'reading: 25.2'],
 ['another event', 'reading: 25.5', 'reading: 25.1']]

I'm not sure how much you gain in raw readability, but it's pretty straightforward to understand with the comments.

Upvotes: 2

DSM
DSM

Reputation: 353569

I wish itertools had a function which did what you wanted. For entertainment value, in modern Python you could do something like

from itertools import groupby, accumulate, tee
def splitter(source, fn):
    s0, s1 = tee(source)
    tick = accumulate(fn(line) for line in s1)
    grouped = groupby(s0, lambda x: next(tick))
    return (list(g) for k,g in grouped)

which gives

>>> with open("event.dat") as fp:
...     s = list(splitter(fp, lambda x: x.strip().endswith("event")))
...     
>>> s
[['some event\n', 'reading: 25.4\n', 'reading: 23.4\n', 'reading: 25.1\n'], 
['different event\n', 'reading: 22.3\n', 'reading: 21.1\n', 'reading: 26.0\n', 'reading: 25.2\n'], 
['another event\n', 'reading: 25.5\n', 'reading: 25.1']]

but to be honest I'd probably do what @abarnert did.

Upvotes: 4

Izkata
Izkata

Reputation: 9323

I doubt itertools (or collections) can make it clearer than this, unless the exact pattern is implemented in there somewhere.

Two things I notice:

  • You always have a current event (since the first line is an event)
  • You always append the line to the current event (so the event itself is always current_event[0]

So you can skip the checking for if you have a current event, and you don't have to special-case creating it either. Additionally, since the "current" event is always the last one, we can just use a negative index to jump straight to it:

events = []

for line in lines:
    if is_event(line):
        events.append([])
    events[-1].append(line)

def is_event(line):
    return 'event' in line

Upvotes: 5

Related Questions