NER combining BIO tokens to form original compound word

Question

Any way to combine the BIO tokens into compound words. I implemented this method to form words from BIO schema but this does not work well for words with punctuations. For eg: S.E.C using the below function will join it as S . E . C

def collapse(ner_result):
    # List with the result
    collapsed_result = []


    current_entity_tokens = []
    current_entity = None

    # Iterate over the tagged tokens
    for token, tag in ner_result:

        if tag.startswith("B-"):
            # ... if we have a previous entity in the buffer, store it in the result list
            if current_entity is not None:
                collapsed_result.append([" ".join(current_entity_tokens), current_entity])

            current_entity = tag[2:]
            # The new entity has so far only one token
            current_entity_tokens = [token]

        # If the entity continues ...
        elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
            # Just add the token buffer
            current_entity_tokens.append(token)
        else:
            collapsed_result.append([" ".join(current_entity_tokens), current_entity])
            collapsed_result.append([token,tag[2:]])

            current_entity_tokens = []
            current_entity = None

            pass

    # The last entity is still in the buffer, so add it to the result
    # ... but only if there were some entity at all
    if current_entity is not None:
        collapsed_result.append([" ".join(current_entity_tokens), current_entity])
        collapsed_result = sorted(collapsed_result)
        collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))


    return collapsed_result

Another method:-

I tried to detokenize using TreebankWordDetokenizer but it still did not form the original sentence. For eg: Orig: sentence -> parties. IN WITNESS WHEREOF, the parties hereto tokenized and detokenized sentence -> parties . IN WITNESS WHEREOF, the parties hereto

Another example: Orig: sentence -> Group’s employment, Group shall be tokenized and detokenized sentence -> Group ’ s employment, Group shall be

Note that period and newlines are stripped using the TreebankWordDetokenizer.

Any workaround to form compound words?

igrinis · Accepted Answer

A really small fix should do the job:

def join_tokens(tokens):
    res = ''
    if tokens:
        res = tokens[0]
        for token in tokens[1:]:
            if not (token.isalpha() and res[-1].isalpha()):
                res += token  # punctuation
            else:
                res += ' ' + token  # regular word
    return res

def collapse(ner_result):
    # List with the result
    collapsed_result = []


    current_entity_tokens = []
    current_entity = None

    # Iterate over the tagged tokens
    for token, tag in ner_result:

        if tag.startswith("B-"):
            # ... if we have a previous entity in the buffer, store it in the result list
            if current_entity is not None:
                collapsed_result.append([join_tokens(current_entity_tokens), current_entity])

            current_entity = tag[2:]
            # The new entity has so far only one token
            current_entity_tokens = [token]

        # If the entity continues ...
        elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
            # Just add the token buffer
            current_entity_tokens.append(token)
        else:
            collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
            collapsed_result.append([token,tag[2:]])

            current_entity_tokens = []
            current_entity = None

            pass

    # The last entity is still in the buffer, so add it to the result
    # ... but only if there were some entity at all
    if current_entity is not None:
        collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
        collapsed_result = sorted(collapsed_result)
        collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))

    return collapsed_result

Update

This will solve most of the cases, but as can be seen in comments below there always be outliers. So the complete solution is to track the identity of the word that created certain token. Thus

text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]  

# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]

Now, given token index you can know exact word it came from, and simply concatenate tokens that belong to the same word, while adding space when a token belongs to a different word. So the NER result would be something like:

[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]

NER combining BIO tokens to form original compound word

Answers (1)

Update

Related Questions