Reputation: 2253
tl;dr How can I combine a word root and part-of-speech tags into a fully modified word?
eg something like:
getText('easy', 'adjective', 'superlative') --> 'easiest'
getText('eat', 'verb', '3rd-person-singular') --> 'eats'
getText('spoon', 'noun', 'plural') --> 'spoons'
getText('swim', 'verb', 'past-participle') --> 'swum'
etc
spaCy can tokenize/parse this sentence into the following tokens containing "TEXT", "LEMMA", part of speech tag ("POS"), detailed part of speech tag ("TAG"), etc:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
TEXT LEMMA POS TAG DEP SHAPE ALPHA STOP
Apple apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
...
I would like to reverse this process -- to get a "TEXT" string given a specific "LEMMA"/"POS"/"TAG" combo.
That is, where something like
getText(lemma="look", pos="verb", tag="vbg")
would return "looking"
.
Is this possible to do in spaCy, and if so, how?
If not, is it possible to untokenize words from roots/lemmas and part-of-speech tags with a different library?
I know that pattern.en can pluralize/conjugate/etc ("untokenize"?) words, but it would be nice to use spaCy's faster processing speed and python3 compatibility.
Another reason for not wanting to use pattern.en: I want to tokenize and then untokenize text later, and it would be nice to use the same library for both. I've found spaCy to be much better at tokenizing than pattern.en. (eg pattern.en doesn't tokenize "easiest" into "easy", but spaCy does).
By "tokenize" I mean splitting a sentence into word roots and part-of-speech tags.
Upvotes: 3
Views: 1360
Reputation: 712
As far as I know, spaCy doesn't currently have that functionality built in. However, it would be fairly easy to set up custom token attributes that would do something similar to what you're asking. For example, if you wanted to define a past-tense conjugation attribute for all the verb tokens, you could create a VBD
function and apply it as a getter on each token as a custom attribute, as follows:
>>> import spacy
>>> nlp = spacy.load('en')
>>> def vbd(token):
... """a bad conjugation function"""
... if token.pos_ == 'VERB':
... return token.lemma_ + 'ed'
>>> spacy.tokens.Token.set_extension('vbd', getter=vbd, default=None)
>>> doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
>>> for token in doc:
... print(token.text, ":", token._.vbd)
Apple : None
is : beed
looking : looked
at : None
buying : buyed
U.K. : None
startup : None
for : None
$ : None
1 : None
billion : None
As you can see, the function isn't very robust as it spits out "beed" and "buyed", but "looked" is correct.
As for a robust way to do do the conjugation, pattern
is the best library I've encountered. If you replace the vbd
function with the correct conjugation function plus define functions for whatever other conjugations or inflections you want you'd be pretty close to what you're imagining. This would allow you to use pattern
only for the conjugation, but tokenize and lemmatize with spaCy
.
Upvotes: 2