Orest Xherija
Orest Xherija

Reputation: 464

Concatenating selected strings in list of strings

The problem is as follows. I have a list of strings

lst1=['puffing','his','first','cigarette','in', 'weeks', 'in', 'weeks']

and I would like to obtain the string

lst2=['puffing','his','first','cigarette','in weeks', 'in weeks']

that is to concatenate any occurence of the sublist ['in', 'weeks'] for reasons that are irrelevant here, where find_sub_list1 is taken from here (and included in the code below):

npis = [['in', 'weeks'], ['in', 'ages']]

# given a list a candidate sublist, return the index of the first and last
# element of the sublist within the list
def find_sub_list1(sl,l):
    results=[]
    sll=len(sl)
    for ind in (i for i,e in enumerate(l) if e==sl[0]):
        if l[ind:ind+sll]==sl:
        results.append((ind,ind+sll-1))

    return results

def concatenator(sent, npis):
    indices = []
    for npi in npis:
        indices_temp = find_sub_list1(npi, sent)
        if indices_temp != []:
            indices.extend(indices_temp)
    sorted(indices, key=lambda x: x[0])

    for (a,b) in indices:
        diff = b - a
        sent[a:b+1] = [" ".join(sent[a:b+1])]
        del indices[0]
        indices = [(a - diff, b - diff) for (a,b) in indices]

    return sent 

instead of the desired lst2 this coder returns:

concatenator(lst1,['in', 'weeks'])
>>['puffing','his','first','cigarette','in weeks', 'in', 'weeks']

so it only concatenates the first occurrence. Any ideas about where the code is failing?

Upvotes: 6

Views: 251

Answers (2)

Haleemur Ali
Haleemur Ali

Reputation: 28243

since the desired sub-sequence is 'in' 'weeks' and possibly 'in''ages'

One possible solution could be (the looping is not very elegant though):

  1. first find all positions where 'in' occurs.

  2. then iterate through the source list, appending elements to the target list, and treating the positions of 'in' specially, i.e. if the following word is in a special set then join the two & append to the target, advancing the iterator one extra time.

  3. Once the source list is exhausted an IndexError will be thrown, indicating that we should break the loop.

code:

index_in = [i for i, _ in enumerate(lst1) if _ == 'in']

lst2 = []; n = 0

while True:
    try:
         if n in index_in and lst1[n+1] in ['weeks', 'ages']:
             lst2.append(lst1[n] + lst1[n+1])
             n += 1
         else:
             lst2.append(lst1[n])
         n += 1
     except IndexError:
         break

A better way to do this would be through regular expressions.

  1. join the list to a string with space as a separator

  2. split the list on spaces, except those spaces surrounded by in<space>weeks. Here, we can use negative lookahead & lookbehind

code:

import re

c = re.compile(r'(?<!in) (?!weeks)')

lst2 = c.split(' '.join(lst1))

Upvotes: 2

Nullman
Nullman

Reputation: 4279

this isn't a fix for your code, but an alternate solution (I always end up using regex for everything)

import re
list1_str = ','.join(lst1)
npis_concat = [','.join(x) for x in npis]
for item in npis_concat:
    list1_str = re.sub(r'\b'+item+r'\b',item.replace(',', ' '),list1_str)
lst1 = list1_str.split(',')

I use a comma here, but you can replace it with any character, preferably one you know wont be in your text

the r'\b' are used to make sure we don't accidentally chop of bits from words ending/beginning with stuff in npis

Upvotes: 0

Related Questions