Reputation: 83387
I use the benepar parser to parse sentences into trees. How can I prevent the benepar parser from splitting a specific substring when parsing a string?
E.g., the token gonna
is split by benepar into two tokens gon
and na
, which I don't want.
Code example, with pre-requisites:
pip install spacy benepar
python -m nltk.downloader punkt benepar_en3
python -m spacy download en_core_web_md
If I run:
import benepar, spacy
import nltk'benepar_en3')
nlp = spacy.load('en_core_web_md')
if spacy.__version__.startswith('2'):
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("This is gonna be fun.")
sent = list(doc.sents)[0]
It'll output:
(S (NP (DT This)) (VP (VBZ is) (VP (TO gon) (VP (TO na) (VP (VB be) (NP (NN fun)))))) (. .))
The issue is that the token gonna
is split into two tokens gon
and na
. How can I prevent that?
Upvotes: 1
Views: 648
Reputation: 83387
Use nlp.tokenizer.add_special_case
import benepar, spacy
import nltk'benepar_en3')
nlp = spacy.load('en_core_web_md')
from spacy.symbols import ORTH
nlp.tokenizer.add_special_case(u'gonna', [{ORTH: u'gonna'}])
if spacy.__version__.startswith('2'):
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("This is gonna be fun.")
sent = list(doc.sents)[0]
This is the output for the above code:
(S (NP (DT This)) (VP (VBZ is) (VP (TO gonna) (VP (VB be) (NP (NN fun))))) (. .))
Upvotes: 0