KillerSnail
KillerSnail

Reputation: 3591

split python list given the index start

I have looked at this:Split list into sublist based on index ranges

But my problem is slightly different. I have a list

List = ['2016-01-01', 'stuff happened', 'details', 
        '2016-01-02', 'more stuff happened', 'details', 'report']

I need to split it up into sublists based on the dates. Basically it's an event log but due to shitty DB design the system concats separate update messages for an event into one big list of strings. I have:

Event_indices = [i for i, word in enumerate(List) if 
                 re.match(date_regex_return_all = "(\d+\-\d+\-\d+",word)]

which for my example will give:

[0,3]

Now I need split the list into separate lists based on the indexes. So for my example ideally I want to get:

[List[0], [List[1], List[2]]], [List[3], [List[4],  List[5], List[6]] ]

so the format is:

[event_date, [list of other text]], [event_date, [list of other text]]

There are also edge cases where there is no date string which would be the format of:

Special_case = ['blah', 'blah', 'stuff']
Special_case_2 = ['blah', 'blah', '2015-01-01', 'blah', 'blah']

result_special_case = ['', [Special_case[0], Special_case[1],Special_case[2] ]]
result_special_case_2 = [ ['', [ Special_case_2[0], Special_case_2[1] ] ], 
                          [Special_case_2[2], [ Special_case_2[3],Special_case_2[4] ] ] ]

Upvotes: 2

Views: 512

Answers (2)

acw1668
acw1668

Reputation: 46687

Try:

def split_by_date(arr, patt='\d+\-\d+\-\d+'):
    results = []
    srch = re.compile(patt)
    rec = ['', []]
    for item in arr:
        if srch.match(item):
            if rec[0] or rec[1]:
                results.append(rec)
            rec = [item, []]
        else:
            rec[1].append(item)
    if rec[0] or rec[1]:
        results.append(rec)
    return results

Then:

normal_case = ['2016-01-01', 'stuff happened', 'details', 
               '2016-01-02', 'more stuff happened', 'details', 'report']
special_case_1 = ['blah', 'blah', 'stuff', '2016-11-11']
special_case_2 = ['blah', 'blah', '2015/01/01', 'blah', 'blah']

print(split_by_date(normal_case))
print(split_by_date(special_case_1))
print(split_by_date(special_case_2, '\d+\/\d+\/\d+'))

Upvotes: 1

ShadowRanger
ShadowRanger

Reputation: 155428

You don't need to perform a two-pass grouping at all, because you can use itertools.groupby to both segment by dates and their associated events in a single pass. By avoiding the need to compute indices and then slice a list using them, you could process a generator that provides the values one at a time, avoiding memory issues if your inputs are huge. To demonstrate, I've taken your original List and expanded it a bit to show this handles edge cases correctly:

import re

from itertools import groupby

List = ['undated', 'garbage', 'then', 'twodates', '2015-12-31',
        '2016-01-01', 'stuff happened', 'details', 
        '2016-01-02', 'more stuff happened', 'details', 'report',
        '2016-01-03']

datere = re.compile(r"\d+\-\d+\-\d+")  # Precompile regex for speed
def group_by_date(it):
    # Make iterator that groups dates with dates and non-dates with dates
    grouped = groupby(it, key=lambda x: datere.match(x) is not None)
    for isdate, g in grouped:
        if not isdate:
            # We had a leading set of undated events, output as undated
            yield ['', list(g)]
        else:
            # At least one date found; iterate with one loop delay
            # so final date can have events included (all others have no events)
            lastdate = next(g)
            for date in g:
                yield [lastdate, []]
                lastdate = date

            # Final date pulls next group (which must be events or the end of the input)
            try:
                # Get next group of events
                events = list(next(grouped)[1])
            except StopIteration:
                # There were no events for final date
                yield [lastdate, []]
            else:
                # There were events associated with final date
                yield [lastdate, events]

print(list(group_by_date(List)))

which outputs (newlines added for readability):

[['', ['undated', 'garbage', 'then', 'twodates']],
 ['2015-12-31', []],
 ['2016-01-01', ['stuff happened', 'details']],
 ['2016-01-02', ['more stuff happened', 'details', 'report']],
 ['2016-01-03', []]]

Upvotes: 1

Related Questions