Sai Prasanna
Sai Prasanna

Reputation: 694

How to detokenize spacy text without doc context?

I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.

The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.

Example:

Input to Seq2Seq: Some text

Output from Seq2Seq: This does n't work .

Is there any API in spacy to reverse tokenization done by rules in its tokenizer?

Upvotes: 3

Views: 4459

Answers (2)

Yurkee
Yurkee

Reputation: 873

TL;DR I've written a code that attempts to do it, the snippet is below.


Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote. The main thought was "What spaCy splits, shall be rejoined once more!"

Code:

#!/usr/bin/env python                     
import spacy     
import string

                    
                                                                                              
                                               
class detokenizer:                                                                            
    """ This class is an attempt to detokenize spaCy tokenized sentence """
    def __init__(self, model="en_core_web_sm"):             
        self.nlp = spacy.load(model)
     
    def __call__(self, tokens : list):
        """ Call this method to get list of detokenized words """                     
        while self._connect_next_token_pair(tokens):
            pass              
        return tokens                                                                         
                                               
    def get_sentence(self, tokens : list) -> str:                                                                                                                                            
        """ call this method to get detokenized sentence """            
        return " ".join(self(tokens))
                                               
    def _connect_next_token_pair(self, tokens : list):                  
        i = self._find_first_pair(tokens)
        if i == -1:                                                                                                                                                                          
            return False                                                                                                                 
        tokens[i] = tokens[i] + tokens[i+1]                                                   
        tokens.pop(i+1)                                                                                                                                                                       
        return True                                                                                                                                                                          
                                                                                                                                                                                             
                                                                                                                                                                                             
    def _find_first_pair(self,tokens):                                                                                                                                                       
        if len(tokens) <= 1:                                                                                                                                                                 
            return -1                                                                         
        for i in range(len(tokens)-1):
            if self._would_spaCy_join(tokens,i):                                
                return i
        return -1                                                                             
                                               
    def _would_spaCy_join(self, tokens, index):                                       
        """             
        Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...                                                                  
                        
        In other words, we say we should join only if the join is reversible.          
        eg.:             
            for the text ["The","man","."]
            we would joins "man" with "."
            but wouldn't join "The" with "man."                                               
        """                                    
    left_part = tokens[index]
    right_part = tokens[index+1]
    length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
    length_after_join = len(self.nlp(left_part + right_part))
    if self.nlp(left_part)[-1].text in string.punctuation:
        return False
    return length_before_join == length_after_join 

Usage:

import spacy                           
dt = detokenizer()                     

sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")      
spaCy_tokenized = nlp(sentence)                      

string_tokens = [a.text for a in spaCy_tokenized]           

detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)

print(sentence)    
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)

output:

I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']

Downsides:

In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word. This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.

I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.


After having searched for hours on Google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and all it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)

Upvotes: 4

syllogism_
syllogism_

Reputation: 4297

Internally spaCy keeps track of a boolean array to tell whether the tokens have trailing whitespace. You need this array to put the string back together. If you're using a seq2seq model, you could predict the spaces separately.

James Bradbury (author of TorchText) was complaining to me about exactly this. He's right that I didn't think about seq2seq models when I designed the tokenization system in spaCy. He developed revtok to solve his problem.

Basically what revtok does (if I understand correctly) is pack two extra bits onto the lexeme IDs: whether the lexeme has an affinity for a preceding space, and whether it has an affinity for a following space. Spaces are inserted between tokens whose lexemes both have space affinity.

Here's the code to find these bits for a spaCy Doc:

def has_pre_space(token):
    if token.i == 0:
        return False
    if token.nbor(-1).whitespace_:
        return True
    else:
        return False

def has_space(token):
    return token.whitespace_

The trick is that you drop a space when either the current lexeme says "no trailing space" or the next lexeme says "no leading space". This means you can decide which of those two lexemes to "blame" for the lack of the space, using frequency statistics.

James's point is that this strategy adds very little entropy to the word prediction decision. Alternate schemes will expand the lexicon with entries like hello. or "Hello. His approach does neither, because you can code the string hello. as either (hello, 1, 0), (., 1, 1) or as (hello, 1, 0), (., 0, 1). This choice is easy: we should definitely "blame" the period for the lack of the space.

Upvotes: 10

Related Questions