stuart
stuart

Reputation: 2253

Get fully formed word "text" from word root (lemma) and part-of-speech (POS) tags in spaCy

tl;dr How can I combine a word root and part-of-speech tags into a fully modified word?

eg something like:

getText('easy', 'adjective', 'superlative') --> 'easiest'

getText('eat', 'verb', '3rd-person-singular') --> 'eats'

getText('spoon', 'noun', 'plural') --> 'spoons'

getText('swim', 'verb', 'past-participle') --> 'swum'

etc

spaCy can tokenize/parse this sentence into the following tokens containing "TEXT", "LEMMA", part of speech tag ("POS"), detailed part of speech tag ("TAG"), etc:

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

parsed tokens:

TEXT    LEMMA   POS     TAG DEP         SHAPE   ALPHA   STOP
Apple   apple   PROPN   NNP nsubj       Xxxxx   True    False
is      be      VERB    VBZ aux         xx      True    True
looking look    VERB    VBG ROOT        xxxx    True    False
at      at      ADP     IN  prep        xx      True    True
buying  buy     VERB    VBG pcomp       xxxx    True    False
U.K.    u.k.    PROPN   NNP compound    X.X.    False   False
...

I would like to reverse this process -- to get a "TEXT" string given a specific "LEMMA"/"POS"/"TAG" combo.

That is, where something like

getText(lemma="look", pos="verb", tag="vbg")

would return "looking".

Is this possible to do in spaCy, and if so, how?

If not, is it possible to untokenize words from roots/lemmas and part-of-speech tags with a different library?

I know that pattern.en can pluralize/conjugate/etc ("untokenize"?) words, but it would be nice to use spaCy's faster processing speed and python3 compatibility.

Another reason for not wanting to use pattern.en: I want to tokenize and then untokenize text later, and it would be nice to use the same library for both. I've found spaCy to be much better at tokenizing than pattern.en. (eg pattern.en doesn't tokenize "easiest" into "easy", but spaCy does).

By "tokenize" I mean splitting a sentence into word roots and part-of-speech tags.

Upvotes: 3

Views: 1360

Answers (1)

pmbaumgartner
pmbaumgartner

Reputation: 712

As far as I know, spaCy doesn't currently have that functionality built in. However, it would be fairly easy to set up custom token attributes that would do something similar to what you're asking. For example, if you wanted to define a past-tense conjugation attribute for all the verb tokens, you could create a VBD function and apply it as a getter on each token as a custom attribute, as follows:

>>> import spacy
>>> nlp = spacy.load('en')

>>> def vbd(token):
...     """a bad conjugation function"""
...     if token.pos_ == 'VERB':
...         return token.lemma_ + 'ed'

>>> spacy.tokens.Token.set_extension('vbd', getter=vbd, default=None)
>>> doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
>>> for token in doc:
...     print(token.text, ":", token._.vbd)

Apple : None
is : beed
looking : looked
at : None
buying : buyed
U.K. : None
startup : None
for : None
$ : None
1 : None
billion : None

As you can see, the function isn't very robust as it spits out "beed" and "buyed", but "looked" is correct.

As for a robust way to do do the conjugation, pattern is the best library I've encountered. If you replace the vbd function with the correct conjugation function plus define functions for whatever other conjugations or inflections you want you'd be pretty close to what you're imagining. This would allow you to use pattern only for the conjugation, but tokenize and lemmatize with spaCy.

Upvotes: 2

Related Questions