Reputation: 9869
I am trying to replace a word without destroying the space structure in the sentence. Suppose I have the sentence text = "Hi this is my dog."
. And I wish to replace dog with Simba
. Following the answer from https://stackoverflow.com/a/57206316/2530674 I did:
import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc
doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
Doc(doc1.vocab, words=new_words)
# Hi this is my Simba .
Notice how there was an extra space at the end before the full stop (it ought to be Hi this is my Simba.
). Is there a way to remove this behaviour. Happy for a general python string processing answer too.
Upvotes: 4
Views: 8694
Reputation: 1530
text = 'Hello This is my dog'
print(text.replace('dog','simba'))
Upvotes: 1
Reputation: 101
I had a similar issue. I was trying to replace original tokens in the document with the lemma form. Originally, I used the below to make the changes:
def lemma_conversion(sent):
carrier_str = str()
for token in sent:
carrier_str = carrier_str + token.lemma_ + ' '
return (carrier_str)
where "sent" is an individual sentence (as a spaCy object) from the whole document. This worked except it introduced unwanted whitespace around punctuation. So instead, I decided to use the string.replace() method so that I could preserve the spacing. But, in my text I had multiple words per sentence that needed to be replaced. I could have used a regular expression to replace the word using word boundaries with optional punctuation but I wanted to be sure that I didn't have any weird exceptions. So instead, I made the replacements using string slicing to be sure I was replacing the exact word I was interested in. But, the lemmas are often shorter than form in the original text. To compensate for that I used a position offset counter to keep the alignment between the string form of the text versus the sentence as a spaCy object:
# this function replaces the original form of the word in the original sentence with
# the lemma form. This preserves the spacing with regard to punctuation.
def nice_lemma_sent(input_sent):
j = 0
lemma_sent = input_sent.text
offset_counter = 0
for token in input_sent:
j += 1
# the .idx value for the characters in the extracted sentences is based on the whole
# document. This first if statement determines the .idx for the first token in each
# sentence. this is used for adjusting the offset when doing the replacement of the
# original word with the lemma
if j == 1:
first_character_position = token.idx
# this identifies those tokens where the lemma is different. it then gets the values
# for the words length and position so that slicing operations will cut them out
# and replace them with the lemma
if token.text != token.lemma_:
start_of_word = token.idx + offset_counter - first_character_position
len_word = len(token.text)
end_of_word = start_of_word + len_word
len_lemma = len(token.lemma_)
# substitution of the first word in the sentence if the lemma form is
# different from the original form
if token.idx == first_character_position:
residual_sent_start_position = len_word
lemma_sent = token.lemma_ + lemma_sent[residual_sent_start_position:]
# substitution of subsequent words in the sentence if they are different
# from the original form
else:
front_sent_end = start_of_word
residual_sent_start = end_of_word
lemma_sent = lemma_sent[0:front_sent_end] + token.lemma_ + \
lemma_sent[residual_sent_start:]
offset_counter = len_lemma - len_word + offset_counter
return (lemma_sent)
Upvotes: 1
Reputation: 51
Spacy Tokens have some attributes that could help you. First there's token.text_with_ws, which gives you the token's text with its original trailing whitespace if it had any. Second, token.whitespace_, which just returns the trailing whitespace on the token (empty string if there was no whitespace). If you don't need the large language model for other things you're doing, you could just use Spacy's tokenizer.
from spacy.lang.en import English
nlp = English() # you probably don't need to load whole lang model for this
tokenizer = nlp.tokenizer
tokens = tokenizer("Hi this is my dog.")
modified = ""
for token in tokens:
if token.text != "dog":
modified += token.text_with_ws
else:
modified += "Simba"
modified += token.whitespace_
Upvotes: 5
Reputation: 11
You can specify where you want to add spaces :
import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc
doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
spaces = [True]*len(doc1)
spaces[-2:] = [False, False]
Doc(doc1.vocab, words=new_words, spaces=spaces)
Upvotes: 1
Reputation: 81
Here is how i do it with regex:
sentence = 'Hi this is my dog. dogdog this is mydog'
replacement = 'Simba'
to_replace = 'dog'
st = re.sub(f'(\W|^)+({to_replace})(\W|$)+', f'\g<1>{replacement}\g<3>', sentence)
Upvotes: 1
Reputation: 61
The below function replaces any number of matches (found with spaCy), keeps the same whitespacing as the original text, and appropriately handles edge cases (like when the match is at the beginning of the text):
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])
def replace_word(orig_text, replacement):
tok = nlp(orig_text)
text = ''
buffer_start = 0
for _, match_start, _ in matcher(tok):
if match_start > buffer_start: # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
text += replacement + tok[match_start].whitespace_ # Replace token, with trailing whitespace if available
buffer_start = match_start + 1
text += tok[buffer_start:].text
return text
>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.
>>> replace_word("Hi this dog is my dog.", "Simba")
Hi this Simba is my Simba.
Upvotes: 6
Reputation: 9869
Thanks to @lora-johns I found this answer. So without going down the matcher route, I think this might be a simpler answer:
new_words = [(token.idx, len("dog")) for token in doc1 if token.text.lower()=="dog"]
# reverse order of replacement words from end to start
new_words = sorted(new_words, key=lambda x:-x[0])
for i, l in new_words:
text = text[:i] + "Simba" + text[i+l:]
Upvotes: 1
Reputation: 808
One way to do this in an extensible way would be to use the spacy Matcher and to modify the Doc object, like so:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
matcher.add("dog", on_match, [{"LOWER": "dog"}])
def replace_word(doc, replacement):
doc = nlp(doc)
match_id, start, end = matcher(doc)[0] #assuming only one match replacement
return nlp.make_doc(doc[:start].text + f" {replacement}" + doc[-1].text)
>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.
You could of course expand this pattern and replace all instances of "dog" by adding a for-loop in the function instead of just replacing the first match, and you could swap out rules in the matcher to change different words.
The nice thing about doing it this way, even though it's more complex, is that it lets you keep the other information in the spacy Doc object, like the lemmas, parts of speech, entities, dependency parse, etc.
But you if you just have a string, you don't need to worry about all that. To do this with plain Python, I'd use regex.
import re
def replace_word_re(text, word, replacement):
return re.sub(word, replacement, text)
>>> replace_word_re("Hi this is my dog.", "dog", "Simba")
Hi this is my Simba.
Upvotes: 2
Reputation: 2609
So it seems like you are looking for a regular replace? I would just do
string = "Hi this is my dog."
string = string.replace("dog","Simba")
Upvotes: 1