whackamadoodle3000
whackamadoodle3000

Reputation: 6748

Splitting a List Based on a Substring

I have the following list:

['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']

I want to split this list into multiple lists so that each sublist will have the substring "(Reg)" appear once:

[['1(Reg)', '100', '103', '102', '100'],
['2(Reg)', '98', '101', '100'],
['3(Reg)', '96', '99', '98'],
['4(Reg)', '100', '100', '100', '100'],
['5(Reg)', '98', '99', '99', '100'],
['6(Reg)', '99.47', '99.86', '99.67', '100']]

I've tried joining the list with a delimiter and splitting it by (Reg), but that didn't work. How can I split the list into a nested list like above?

Upvotes: 4

Views: 2254

Answers (8)

whackamadoodle3000
whackamadoodle3000

Reputation: 6748

Here's another way with no libraries. It is a list comprehension built off of DYZ's answer:

w = []
[w.append([e]) if '(Reg)' in e else w[-1].append(e) for e in data]

Upvotes: 1

RoadRunner
RoadRunner

Reputation: 26315

You can also try this:

from itertools import groupby

lst = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100',
       '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100',
       '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']

grouped = [list(g) for k, g in groupby(lst, key = lambda x: x.endswith('(Reg)'))]

result = [x + y for x, y in zip(grouped[0::2], grouped[1::2])]

print(result)

Which Outputs:

[['1(Reg)', '100', '103', '102', '100'], ['2(Reg)', '98', '101', '100'], ['3(Reg)', '96', '99', '98'], ['4(Reg)', '100', '100', '100', '100'], ['5(Reg)', '98', '99', '99', '100'], ['6(Reg)', '99.47', '99.86', '99.67', '100']]

Upvotes: 1

Transhuman
Transhuman

Reputation: 3547

Using itertools.groupby

lst = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
from itertools import groupby
[a+b for a,b in zip(*([iter(list(g) for k, g in groupby(lst, lambda x:'Reg' in x))]*2))]

Output:

[['1(Reg)', '100', '103', '102', '100'],
 ['2(Reg)', '98', '101', '100'],
 ['3(Reg)', '96', '99', '98'],
 ['4(Reg)', '100', '100', '100', '100'],
 ['5(Reg)', '98', '99', '99', '100'],
 ['6(Reg)', '99.47', '99.86', '99.67', '100']]

Upvotes: 2

Pavel
Pavel

Reputation: 7552

Ok, here's my take with super-simple standard list comprehensions (very similar to @jp_data_analysis's answer):

>>> from pprint import pprint
>>> d = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
>>> idx = filter(lambda i: d[i].endswith("(Reg)"), range(len(d))) + [len(d)]
>>> idx
[0, 5, 9, 13, 18, 23, 28]
>>> res = [d[idx[i-1]:idx[i]] for i in range(1,len(idx))]
>>> pprint(res)
[['1(Reg)', '100', '103', '102', '100'],
 ['2(Reg)', '98', '101', '100'],
 ['3(Reg)', '96', '99', '98'],
 ['4(Reg)', '100', '100', '100', '100'],
 ['5(Reg)', '98', '99', '99', '100'],
 ['6(Reg)', '99.47', '99.86', '99.67', '100']]

Explanation: idx holds the indices of every element ending in (Reg) (including the list length as the final element). Then the list res is defined via intervals between those elements.

On a philosophical note: every time you face a problem like this, ask yourself: how did I get here? Why do I need to deal with some super-fragile implicit-string-format-rules instead of a real data structure? One that takes intervals and data hierarchy into account? One that enforces limitations by design and allows for simple querying? Find someone to blame and rant about them on Twitter :)

Upvotes: 4

jpp
jpp

Reputation: 164673

Here is one way, though not necessarily optimal:

from itertools import zip_longest

lst = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100',
       '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100',
       '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']

indices = [i for i, j in enumerate(lst) if '(Reg)' in j]
lst_new = [lst[i:j] for i, j in zip_longest(indices, indices[1:])]

# [['1(Reg)', '100', '103', '102', '100'],
#  ['2(Reg)', '98', '101', '100'],
#  ['3(Reg)', '96', '99', '98'],
#  ['4(Reg)', '100', '100', '100', '100'],
#  ['5(Reg)', '98', '99', '99', '100'],
#  ['6(Reg)', '99.47', '99.86', '99.67', '100']]

Upvotes: 5

Ajax1234
Ajax1234

Reputation: 71451

You can use itertools.groupby with regular expressions:

import itertools
import re
s = ['1(Reg)', '100', '103', '102', '100', '2(Reg)', '98', '101', '100', '3(Reg)', '96', '99', '98', '4(Reg)', '100', '100', '100', '100', '5(Reg)', '98', '99', '99', '100', '6(Reg)', '99.47', '99.86', '99.67', '100']
new_data = [list(b) for _, b in itertools.groupby(s, key=lambda x:bool(re.findall('\d+\(', x)))]
final_data = [new_data[i]+new_data[i+1] for i in range(0, len(new_data), 2)]

Output:

[['1(Reg)', '100', '103', '102', '100'], 
 ['2(Reg)', '98', '101', '100'], 
 ['3(Reg)', '96', '99', '98'], 
 ['4(Reg)', '100', '100', '100', '100'], 
 ['5(Reg)', '98', '99', '99', '100'], 
 ['6(Reg)', '99.47', '99.86', '99.67', '100']]

Upvotes: 5

DYZ
DYZ

Reputation: 57033

A slightly different (optimized) version of WVO's answer:

splitted = []

for item in l:
    if '(Reg)' in item:
        splitted.append([])
    splitted[-1].append(item)

#[['1(Reg)', '100', '103', '102', '100'], ['2(Reg)', '98', '101', '100'], 
# ['3(Reg)', '96', '99', '98'], ['4(Reg)', '100', '100', '100', '100'], 
# ['5(Reg)', '98', '99', '99', '100'], 
# ['6(Reg)', '99.47', '99.86', '99.67', '100']]

Upvotes: 6

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476614

We can use a for loop for this and use two lists: one of the lists we use to build the current row, and the other lists stores all rows we currently have. Like:

rows = []
row = []
for word in data:
    if '(Reg)' in word:
        rows.append(row)
        row = []
    row.append(word)
rows.append(row)

with data the initial list of strings.

There is a problem with this however: it will first add an empty row (given the first element has (Reg) in it. We can prevent this by only adding non-empty rows, like:

rows = []
row = []
for word in data:
    if '(Reg)' in word:
        if row:
            rows.append(row)
        row = []
    row.append(word)
if row:
    rows.append(row)

We can generalize the above into a dedicated function:

split_at(data, predicate, with_empty=False):
    rows = []
    row = []
    for word in data:
        if predicate(word):
            if with_empty or row:
                rows.append(row)
            row = []
        row.append(word)
    if with_empty or row:
        rows.append(row)
    return rows

We can then call it like:

split_at(our_list, lambda x: '(Reg)' in x)

Upvotes: 2

Related Questions