Will
Will

Reputation: 381

Can the Porter Stemmer return the affix rather than the stem?

I am working on a project in which I am trying to calculate the percentage of inflectional morphology of multiple corpora in order to compare them.

I know how to use the nltk Porter Stemmer in order to get the root of the word, but it would be much more helpful for me if I could return the affix rather than the root. If I could do that, I could just count the number of affixes the stemmer cut off ("ly" "ed" etc) and compare it to the total number of words. It might be a simple flip, but I can't figure out how to do this with the roots.

Upvotes: 1

Views: 317

Answers (2)

Roman Kishchenko
Roman Kishchenko

Reputation: 677

Are you sure that you are talking about inflectional morphology? Inflection means that the part of speech remains unchanged and the word is changed only to express some grammatical features (like past). Inflectional affixes are always suffixes and, if we don't take irregular words into account, there's a limited number of them (-ed, -ing, -er, -est, -s, -es).

However, it seems like you're talking about derivational morphology because there might be only one inflectional suffix so it doesn't make sense for me to count them (it's 0 if it's lemma and 1 if it's inflected form).

If you're talking about derivational affixes, then what you're looking for is called morpheme segmentation/tokenization and it's not an easy thing to do because word derivation processes are influenced by many factors and aren't well defined. In easy cases, we just append a suffix (or prepend prefix) to the root, however there're cases when some letters in the root are dropped (arrive -> arrival), changed (try -> tried or more unusual, like assume -> assumption) or even appended (drama -> dramatist). Moreover, you need to have some semantic knowledge database because without it it's not possible to determine the morphemes correctly in all the cases. For example, the word remember can be tokenized into re- + member. Without semantics, such morphological analysis looks quite reasonable as re- is a quite popular prefix meaning repetition and member is an existing word. Knowing semantic relationship would tell us that member and remember are not related (I believe they might be related etymologically, but in the modern language the relationship is not that obvious).

Checkout Lingua Robot and Morfessor. The first one is an API that parses English Wiktionary and provides the data in a JSON. Affixes are available as part of this JSON. Morfessor is a tool for morphological segmentation, so it does exactly what you need.

Upvotes: 1

Jason Angel
Jason Angel

Reputation: 2444

Well, if you want to get the affix, just removing the root (porter result) from the original word form should work.

Consider this pseudo-code:

word = "hopeful"
stem_word = porter(word)           #  stem_word should be "hope"
affix = word.remove(stem_word)     # affix should be "ful" 

Other possible alternative which maybe can help you is to use a "hyphenator", since it can potentially divide words into morphemes, not just split the word by the root. Therefore, it could give you more affix information.

Upvotes: 0

Related Questions