Reputation: 807
If I process the sentence
'Return target card to your hand'
with spacy and the en_web_core_lg model, it recognize the tokens as below:
Return NOUN target NOUN card NOUN to ADP your ADJ hand NOUN
How can I force 'Return' to be tagged as a VERB? And how can I do it before the parser, so that the parser can better interpret relations between tokens?
There are other situations in which this would be useful. I am dealing with text which contains specific symbols such as {G}
. These three characters should be considered a NOUN, as a whole, and {T}
should be a VERB. But right now I do not know how to achieve that, without developing a new model for tokenizing and for tagging. If I could "force" a token, I could replace these symbols for something that would be recognized as one token and force it to be tagged appropriately. For example, I could replace {G} with SYMBOLG and force tagging SYMBOLG as NOUN.
Upvotes: 2
Views: 2735
Reputation: 930
EDIT: this solution used spaCy 2.0.12 (IIRC).
To answer the second part of your question, you can add special tokenisation rules to the tokeniser, as stated in the docs here. The following code should do what you want, assuming those symbols are unambiguous:
import spacy
from spacy.symbols import ORTH, POS, NOUN, VERB
nlp = spacy.load('en')
nlp.tokenizer.add_special_case('{G}', [{ORTH: '{G}', POS: NOUN}])
nlp.tokenizer.add_special_case('{T}', [{ORTH: '{T}', POS: VERB}])
doc = nlp('This {G} a noun and this is a {T}')
for token in doc:
print('{:10}{:10}'.format(token.text, token.pos_))
Output for this is (the tags are not correct, but this shows the special case rules have been applied):
This DET
{G} NOUN
a DET
noun NOUN
and CCONJ
this DET
is VERB
a DET
{T} VERB
As for the first part of your question, the problem with assigning a part-of-speech to individual words is that they are mostly ambiguous out of context (e.g. "return" noun or verb?). So the above method would not allow you to account for use in context and is likely to generate errors. spaCy does allow you to do token-based pattern matching however, so that is worth having a look at. Maybe there is a way to do what you're after.
Upvotes: 6