Dictionary Based Text CLassification in Python

Question

It's quite a while now that I'm looking for a descent dictionary based text classification library in Python.

My use case is as follow: I will be receiving a long text which would likely talk about several things and hopefully mention some of a pre-defined entities.

text = "Yesterday, I ate a Yelow-fruit. It was the longest fruit I ever ate."
entities = {"apple": ["pink", "sphere"], "banana": ["yellow", "tasty", "long"]}

Please note that spelling errors are intentional !

My goal is to have a program such that given this text and the entities dict (which will change over time), the program should output banana. Hence, the problem can be seen as a dictionay based text classification problem where one will classify the text based on the entities dictionary.

The latest problem seems so standard to me but I fail finding a descent implementation in Python.

Of course, I can go through the text and count word occurences by entity and finally output the most frequent entity. But this approach is very simple and won't survive in real word scenario where the occurences would NOT be exact. I would expect a good approach to include some text similarity metrics and allow user to choose which pre-processings are accpetable (lowercase, stemming, stopwords removal, ...). How tokenization should be done ? Is there a semantic similarity or not ? If there is a semantic similarity, does a dictionay expansion algorithm is offered ?

So far, I've read this R blog post which gives a starting point for a dictionary based text classification in R. This datascience stack-exchange question seems to be related to mine as well. But none of these give a satisfactory anwser.

So, Is there any straightforward library in Python for this kind of task ?

Thanks in adance for your replies.

Dictionary Based Text CLassification in Python

Answers (1)

Related Questions