SRC
SRC

Reputation: 2271

Avoiding <sos> and <eos> being parsed by Spacy

I am stuck with a basic thing but I could not figure out how to make it work. My apologies if it is something super basic. It is just that I am very new to Spacy and do not know how to do this. Could not find any resource on the internet as well.

I have a bunch of sentences like so

a = "<sos> Hello There! <eos>"

I am using this following lines of code to tokenize it using Spacy

import spacy
nlp = spacy.load('en_core_web_sm')
for token in nlp(a):
    print(token.text)

What it prints is something like this

<
sos
>
Hello
There
!
<
eos
>

As you can see, it parsed the <sos> and <eos> metatags. How can I avoid that? The output I would like to see is something like the following

<sos>
Hello
There
!
<eos>

I could not figure out how to achieve this. Any help will be great.

Thanks in advance

Upvotes: 1

Views: 722

Answers (1)

Gizio
Gizio

Reputation: 66

In spaCy, tokenizer checks for exceptions before splitting text. You need to add an exception to tokenizer to treat your symbols as full tokens.

Your code should look like this:

import spacy
from spacy.attrs import ORTH, LEMMA

sent = "<sos> Hello There! <eos>"

nlp = spacy.load('en_core_web_sm')

nlp.tokenizer.add_special_case('<sos>', [{ORTH: "<sos>"}])
nlp.tokenizer.add_special_case('<eos>', [{ORTH: "<eos>"}])

for token in nlp(sent):
    print(token.text)

You can read more about it here: https://spacy.io/api/tokenizer#add_special_case

Upvotes: 5

Related Questions