Reputation: 2271
I am stuck with a basic thing but I could not figure out how to make it work. My apologies if it is something super basic. It is just that I am very new to Spacy and do not know how to do this. Could not find any resource on the internet as well.
I have a bunch of sentences like so
a = "<sos> Hello There! <eos>"
I am using this following lines of code to tokenize it using Spacy
import spacy
nlp = spacy.load('en_core_web_sm')
for token in nlp(a):
print(token.text)
What it prints is something like this
<
sos
>
Hello
There
!
<
eos
>
As you can see, it parsed the <sos>
and <eos>
metatags. How can I avoid that? The output I would like to see is something like the following
<sos>
Hello
There
!
<eos>
I could not figure out how to achieve this. Any help will be great.
Thanks in advance
Upvotes: 1
Views: 722
Reputation: 66
In spaCy, tokenizer checks for exceptions before splitting text. You need to add an exception to tokenizer to treat your symbols as full tokens.
Your code should look like this:
import spacy
from spacy.attrs import ORTH, LEMMA
sent = "<sos> Hello There! <eos>"
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.add_special_case('<sos>', [{ORTH: "<sos>"}])
nlp.tokenizer.add_special_case('<eos>', [{ORTH: "<eos>"}])
for token in nlp(sent):
print(token.text)
You can read more about it here: https://spacy.io/api/tokenizer#add_special_case
Upvotes: 5