Avoiding and being parsed by Spacy

Question

I am stuck with a basic thing but I could not figure out how to make it work. My apologies if it is something super basic. It is just that I am very new to Spacy and do not know how to do this. Could not find any resource on the internet as well.

I have a bunch of sentences like so

a = " Hello There! "

I am using this following lines of code to tokenize it using Spacy

import spacy
nlp = spacy.load('en_core_web_sm')
for token in nlp(a):
    print(token.text)

What it prints is something like this

<
sos
>
Hello
There
!
<
eos
>

As you can see, it parsed the and metatags. How can I avoid that? The output I would like to see is something like the following


Hello
There
!

I could not figure out how to achieve this. Any help will be great.

Thanks in advance

Gizio · Accepted Answer

In spaCy, tokenizer checks for exceptions before splitting text. You need to add an exception to tokenizer to treat your symbols as full tokens.

Your code should look like this:

import spacy
from spacy.attrs import ORTH, LEMMA

sent = " Hello There! "

nlp = spacy.load('en_core_web_sm')

nlp.tokenizer.add_special_case('', [{ORTH: ""}])
nlp.tokenizer.add_special_case('', [{ORTH: ""}])

for token in nlp(sent):
    print(token.text)

You can read more about it here: https://spacy.io/api/tokenizer#add_special_case

Avoiding <sos> and <eos> being parsed by Spacy

Answers (1)

Related Questions

Avoiding &lt;sos&gt; and &lt;eos&gt; being parsed by Spacy

Answers (1)

Related Questions

Avoiding <sos> and <eos> being parsed by Spacy