Diederik
Diederik

Reputation: 15

How to merge items in a list, based on the first two characters of the next item in the list

I have this code, which is inspired by others, which is now succesfully merging items starting with '##' with the previous item in the list. However I have weird behaviour, where the last item is disappearing.

List:

tokens = ['Hello', 'this', 'is', 'a', 's', '##e', '##ntenc', '##e']

Checking if something is a subtoken (which has ##)

def is_subtoken(string):
    if string[:2] == "##":
        return True
    else:
        return False

Merging the tokens

merged_text = []
for i in range(len(tokens)):
    if not is_subtoken(tokens[i]) and (i+1)<len(tokens) and is_subtoken(tokens[i+1]):
        merged_text.append(tokens[i] + tokens[i+1][2:])
        if (i+2)<len(tokens) and is_subtoken(tokens[i+2]):
            merged_text[-1] = merged_text[-1] + tokens[i+2][2:]
    elif not is_subtoken(tokens[i]):
        merged_text.append(tokens[i])

print(merged_text)

This is the output:

['Hello', 'this', 'is', 'a', 'sentenc']

Whereas was expected:

['Hello', 'this', 'is', 'a', 'sentence']

I can't get my head around it. Is there something missing needed to merge a multitude of these '##' items?

Thank you very much.

Upvotes: 0

Views: 207

Answers (2)

tripleee
tripleee

Reputation: 189297

Your processing seems more complex than it needs to be.

merged = []
for token in tokens:
    if token.startswith('##') and merged:
        merged[-1] += token[2:]
    else:
        merged.append(token)

Upvotes: 1

acushner
acushner

Reputation: 9946

you could just use join, replace, and split pretty easily:

'|'.join(tokens).replace('|##', '').split('|')

edit: you're missing the last element because you never add it unless it's not a token

Upvotes: 2

Related Questions