Reputation: 3591
I have looked at this:Split list into sublist based on index ranges
But my problem is slightly different. I have a list
List = ['2016-01-01', 'stuff happened', 'details',
'2016-01-02', 'more stuff happened', 'details', 'report']
I need to split it up into sublists based on the dates. Basically it's an event log but due to shitty DB design the system concats separate update messages for an event into one big list of strings. I have:
Event_indices = [i for i, word in enumerate(List) if
re.match(date_regex_return_all = "(\d+\-\d+\-\d+",word)]
which for my example will give:
[0,3]
Now I need split the list into separate lists based on the indexes. So for my example ideally I want to get:
[List[0], [List[1], List[2]]], [List[3], [List[4], List[5], List[6]] ]
so the format is:
[event_date, [list of other text]], [event_date, [list of other text]]
There are also edge cases where there is no date string which would be the format of:
Special_case = ['blah', 'blah', 'stuff']
Special_case_2 = ['blah', 'blah', '2015-01-01', 'blah', 'blah']
result_special_case = ['', [Special_case[0], Special_case[1],Special_case[2] ]]
result_special_case_2 = [ ['', [ Special_case_2[0], Special_case_2[1] ] ],
[Special_case_2[2], [ Special_case_2[3],Special_case_2[4] ] ] ]
Upvotes: 2
Views: 512
Reputation: 46687
Try:
def split_by_date(arr, patt='\d+\-\d+\-\d+'):
results = []
srch = re.compile(patt)
rec = ['', []]
for item in arr:
if srch.match(item):
if rec[0] or rec[1]:
results.append(rec)
rec = [item, []]
else:
rec[1].append(item)
if rec[0] or rec[1]:
results.append(rec)
return results
Then:
normal_case = ['2016-01-01', 'stuff happened', 'details',
'2016-01-02', 'more stuff happened', 'details', 'report']
special_case_1 = ['blah', 'blah', 'stuff', '2016-11-11']
special_case_2 = ['blah', 'blah', '2015/01/01', 'blah', 'blah']
print(split_by_date(normal_case))
print(split_by_date(special_case_1))
print(split_by_date(special_case_2, '\d+\/\d+\/\d+'))
Upvotes: 1
Reputation: 155428
You don't need to perform a two-pass grouping at all, because you can use itertools.groupby
to both segment by dates and their associated events in a single pass. By avoiding the need to compute indices and then slice a list
using them, you could process a generator that provides the values one at a time, avoiding memory issues if your inputs are huge. To demonstrate, I've taken your original List
and expanded it a bit to show this handles edge cases correctly:
import re
from itertools import groupby
List = ['undated', 'garbage', 'then', 'twodates', '2015-12-31',
'2016-01-01', 'stuff happened', 'details',
'2016-01-02', 'more stuff happened', 'details', 'report',
'2016-01-03']
datere = re.compile(r"\d+\-\d+\-\d+") # Precompile regex for speed
def group_by_date(it):
# Make iterator that groups dates with dates and non-dates with dates
grouped = groupby(it, key=lambda x: datere.match(x) is not None)
for isdate, g in grouped:
if not isdate:
# We had a leading set of undated events, output as undated
yield ['', list(g)]
else:
# At least one date found; iterate with one loop delay
# so final date can have events included (all others have no events)
lastdate = next(g)
for date in g:
yield [lastdate, []]
lastdate = date
# Final date pulls next group (which must be events or the end of the input)
try:
# Get next group of events
events = list(next(grouped)[1])
except StopIteration:
# There were no events for final date
yield [lastdate, []]
else:
# There were events associated with final date
yield [lastdate, events]
print(list(group_by_date(List)))
which outputs (newlines added for readability):
[['', ['undated', 'garbage', 'then', 'twodates']],
['2015-12-31', []],
['2016-01-01', ['stuff happened', 'details']],
['2016-01-02', ['more stuff happened', 'details', 'report']],
['2016-01-03', []]]
Upvotes: 1