sachinruk
sachinruk

Reputation: 9869

Spacy replace token

I am trying to replace a word without destroying the space structure in the sentence. Suppose I have the sentence text = "Hi this is my dog.". And I wish to replace dog with Simba. Following the answer from https://stackoverflow.com/a/57206316/2530674 I did:

import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc

doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
Doc(doc1.vocab, words=new_words)
# Hi this is my Simba . 

Notice how there was an extra space at the end before the full stop (it ought to be Hi this is my Simba.). Is there a way to remove this behaviour. Happy for a general python string processing answer too.

Upvotes: 4

Views: 8694

Answers (9)

R. Baraiya
R. Baraiya

Reputation: 1530

text = 'Hello This is my dog'
print(text.replace('dog','simba'))

Upvotes: 1

Robb Dunlap
Robb Dunlap

Reputation: 101

I had a similar issue. I was trying to replace original tokens in the document with the lemma form. Originally, I used the below to make the changes:

def lemma_conversion(sent):
    carrier_str = str()
    for token in sent:
        carrier_str = carrier_str + token.lemma_ + ' '
    return (carrier_str)

where "sent" is an individual sentence (as a spaCy object) from the whole document. This worked except it introduced unwanted whitespace around punctuation. So instead, I decided to use the string.replace() method so that I could preserve the spacing. But, in my text I had multiple words per sentence that needed to be replaced. I could have used a regular expression to replace the word using word boundaries with optional punctuation but I wanted to be sure that I didn't have any weird exceptions. So instead, I made the replacements using string slicing to be sure I was replacing the exact word I was interested in. But, the lemmas are often shorter than form in the original text. To compensate for that I used a position offset counter to keep the alignment between the string form of the text versus the sentence as a spaCy object:

# this function replaces the original form of the word in the original sentence with
# the lemma form. This preserves the spacing with regard to punctuation.

def nice_lemma_sent(input_sent):
    j = 0
    lemma_sent = input_sent.text
    offset_counter = 0
    for token in input_sent:
        j += 1
        # the .idx value for the characters in the extracted sentences is based on the whole
        # document. This first if statement determines the .idx for the first token in each 
        # sentence. this is used for adjusting the offset when doing the replacement of the 
        # original word with the lemma

        if j == 1:
            first_character_position = token.idx

        # this identifies those tokens where the lemma is different. it then gets the values 
        # for the  words length and position so that slicing operations will cut them out 
        # and replace them with the lemma
        if token.text != token.lemma_:
            start_of_word = token.idx + offset_counter - first_character_position
            len_word = len(token.text)
            end_of_word = start_of_word + len_word
            len_lemma = len(token.lemma_)

            
            # substitution of the first word in the sentence if the lemma form is 
            # different from the original form
            if token.idx == first_character_position:
                residual_sent_start_position = len_word 
                lemma_sent = token.lemma_ + lemma_sent[residual_sent_start_position:]

            # substitution of subsequent words in the sentence if they are different
            # from the original form
            else:
                front_sent_end = start_of_word
                residual_sent_start = end_of_word
                lemma_sent = lemma_sent[0:front_sent_end] + token.lemma_ + \
                             lemma_sent[residual_sent_start:]

            offset_counter = len_lemma - len_word + offset_counter

    return (lemma_sent) 

Upvotes: 1

campc704
campc704

Reputation: 51

Spacy Tokens have some attributes that could help you. First there's token.text_with_ws, which gives you the token's text with its original trailing whitespace if it had any. Second, token.whitespace_, which just returns the trailing whitespace on the token (empty string if there was no whitespace). If you don't need the large language model for other things you're doing, you could just use Spacy's tokenizer.

from spacy.lang.en import English
nlp = English() # you probably don't need to load whole lang model for this
tokenizer = nlp.tokenizer
tokens = tokenizer("Hi this is my dog.")

modified = ""
for token in tokens:
    if token.text != "dog":
        modified += token.text_with_ws
    else:
        modified += "Simba"
        modified += token.whitespace_

Upvotes: 5

Sunktoteka
Sunktoteka

Reputation: 11

You can specify where you want to add spaces :

import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc

doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
spaces = [True]*len(doc1)
spaces[-2:] = [False, False]
Doc(doc1.vocab, words=new_words, spaces=spaces)

Upvotes: 1

Here is how i do it with regex:

sentence = 'Hi this is my dog. dogdog this is mydog'
replacement = 'Simba'
to_replace = 'dog'
st = re.sub(f'(\W|^)+({to_replace})(\W|$)+', f'\g<1>{replacement}\g<3>', sentence)

Upvotes: 1

Ethan Perez
Ethan Perez

Reputation: 61

The below function replaces any number of matches (found with spaCy), keeps the same whitespacing as the original text, and appropriately handles edge cases (like when the match is at the beginning of the text):

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_lg")

matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])

def replace_word(orig_text, replacement):
    tok = nlp(orig_text)
    text = ''
    buffer_start = 0
    for _, match_start, _ in matcher(tok):
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
        text += replacement + tok[match_start].whitespace_  # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += tok[buffer_start:].text
    return text

>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.

>>> replace_word("Hi this dog is my dog.", "Simba")
Hi this Simba is my Simba.

Upvotes: 6

sachinruk
sachinruk

Reputation: 9869

Thanks to @lora-johns I found this answer. So without going down the matcher route, I think this might be a simpler answer:

new_words = [(token.idx, len("dog")) for token in doc1 if token.text.lower()=="dog"]
# reverse order of replacement words from end to start
new_words = sorted(new_words, key=lambda x:-x[0])
for i, l in new_words: 
    text = text[:i] +  "Simba" + text[i+l:] 

Upvotes: 1

Ray Johns
Ray Johns

Reputation: 808

One way to do this in an extensible way would be to use the spacy Matcher and to modify the Doc object, like so:

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
matcher.add("dog", on_match, [{"LOWER": "dog"}])

def replace_word(doc, replacement):
    doc = nlp(doc)
    match_id, start, end = matcher(doc)[0] #assuming only one match replacement

    return nlp.make_doc(doc[:start].text + f" {replacement}" + doc[-1].text)

>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.

You could of course expand this pattern and replace all instances of "dog" by adding a for-loop in the function instead of just replacing the first match, and you could swap out rules in the matcher to change different words.

The nice thing about doing it this way, even though it's more complex, is that it lets you keep the other information in the spacy Doc object, like the lemmas, parts of speech, entities, dependency parse, etc.

But you if you just have a string, you don't need to worry about all that. To do this with plain Python, I'd use regex.

import re
def replace_word_re(text, word, replacement):
    return re.sub(word, replacement, text)

>>> replace_word_re("Hi this is my dog.", "dog", "Simba")
Hi this is my Simba.

Upvotes: 2

Jonatan &#214;str&#246;m
Jonatan &#214;str&#246;m

Reputation: 2609

So it seems like you are looking for a regular replace? I would just do

string = "Hi this is my dog."
string = string.replace("dog","Simba")

Upvotes: 1

Related Questions