Reputation: 15
I have this code, which is inspired by others, which is now succesfully merging items starting with '##' with the previous item in the list. However I have weird behaviour, where the last item is disappearing.
List:
tokens = ['Hello', 'this', 'is', 'a', 's', '##e', '##ntenc', '##e']
Checking if something is a subtoken (which has ##)
def is_subtoken(string):
if string[:2] == "##":
return True
else:
return False
Merging the tokens
merged_text = []
for i in range(len(tokens)):
if not is_subtoken(tokens[i]) and (i+1)<len(tokens) and is_subtoken(tokens[i+1]):
merged_text.append(tokens[i] + tokens[i+1][2:])
if (i+2)<len(tokens) and is_subtoken(tokens[i+2]):
merged_text[-1] = merged_text[-1] + tokens[i+2][2:]
elif not is_subtoken(tokens[i]):
merged_text.append(tokens[i])
print(merged_text)
This is the output:
['Hello', 'this', 'is', 'a', 'sentenc']
Whereas was expected:
['Hello', 'this', 'is', 'a', 'sentence']
I can't get my head around it. Is there something missing needed to merge a multitude of these '##' items?
Thank you very much.
Upvotes: 0
Views: 207
Reputation: 189297
Your processing seems more complex than it needs to be.
merged = []
for token in tokens:
if token.startswith('##') and merged:
merged[-1] += token[2:]
else:
merged.append(token)
Upvotes: 1
Reputation: 9946
you could just use join, replace, and split pretty easily:
'|'.join(tokens).replace('|##', '').split('|')
edit: you're missing the last element because you never add it unless it's not a token
Upvotes: 2