Reputation: 464
The problem is as follows. I have a list of strings
lst1=['puffing','his','first','cigarette','in', 'weeks', 'in', 'weeks']
and I would like to obtain the string
lst2=['puffing','his','first','cigarette','in weeks', 'in weeks']
that is to concatenate any occurence of the sublist ['in', 'weeks']
for reasons that are irrelevant here, where find_sub_list1
is taken from here (and included in the code below):
npis = [['in', 'weeks'], ['in', 'ages']]
# given a list a candidate sublist, return the index of the first and last
# element of the sublist within the list
def find_sub_list1(sl,l):
results=[]
sll=len(sl)
for ind in (i for i,e in enumerate(l) if e==sl[0]):
if l[ind:ind+sll]==sl:
results.append((ind,ind+sll-1))
return results
def concatenator(sent, npis):
indices = []
for npi in npis:
indices_temp = find_sub_list1(npi, sent)
if indices_temp != []:
indices.extend(indices_temp)
sorted(indices, key=lambda x: x[0])
for (a,b) in indices:
diff = b - a
sent[a:b+1] = [" ".join(sent[a:b+1])]
del indices[0]
indices = [(a - diff, b - diff) for (a,b) in indices]
return sent
instead of the desired lst2
this coder returns:
concatenator(lst1,['in', 'weeks'])
>>['puffing','his','first','cigarette','in weeks', 'in', 'weeks']
so it only concatenates the first occurrence. Any ideas about where the code is failing?
Upvotes: 6
Views: 251
Reputation: 28243
since the desired sub-sequence is 'in' 'weeks'
and possibly 'in''ages'
One possible solution could be (the looping is not very elegant though):
first find all positions where 'in'
occurs.
then iterate through the source list, appending elements to the target list, and treating the positions of 'in'
specially, i.e. if the following word is in a special set then join the two & append to the target, advancing the iterator one extra time.
Once the source list is exhausted an IndexError will be thrown, indicating that we should break the loop.
code:
index_in = [i for i, _ in enumerate(lst1) if _ == 'in']
lst2 = []; n = 0
while True:
try:
if n in index_in and lst1[n+1] in ['weeks', 'ages']:
lst2.append(lst1[n] + lst1[n+1])
n += 1
else:
lst2.append(lst1[n])
n += 1
except IndexError:
break
A better way to do this would be through regular expressions.
join the list to a string with space as a separator
split the list on spaces, except those spaces surrounded by in<space>weeks
. Here, we can use negative lookahead & lookbehind
code:
import re
c = re.compile(r'(?<!in) (?!weeks)')
lst2 = c.split(' '.join(lst1))
Upvotes: 2
Reputation: 4279
this isn't a fix for your code, but an alternate solution (I always end up using regex for everything)
import re
list1_str = ','.join(lst1)
npis_concat = [','.join(x) for x in npis]
for item in npis_concat:
list1_str = re.sub(r'\b'+item+r'\b',item.replace(',', ' '),list1_str)
lst1 = list1_str.split(',')
I use a comma here, but you can replace it with any character, preferably one you know wont be in your text
the r'\b'
are used to make sure we don't accidentally chop of bits from words ending/beginning with stuff in npis
Upvotes: 0