Reputation: 1418
It's quite a while now that I'm looking for a descent dictionary based text classification library in Python.
My use case is as follow: I will be receiving a long text
which would likely talk about several things and hopefully mention some of a pre-defined entities
.
text = "Yesterday, I ate a Yelow-fruit. It was the longest fruit I ever ate."
entities = {"apple": ["pink", "sphere"], "banana": ["yellow", "tasty", "long"]}
Please note that spelling errors are intentional !
My goal is to have a program such that given this text and the entities dict (which will change over time), the program should output banana
. Hence, the problem can be seen as a dictionay based text classification problem where one will classify the text based on the entities dictionary.
The latest problem seems so standard to me but I fail finding a descent implementation in Python.
Of course, I can go through the text and count word occurences by entity and finally output the most frequent entity. But this approach is very simple and won't survive in real word scenario where the occurences would NOT be exact. I would expect a good approach to include some text similarity metrics and allow user to choose which pre-processings are accpetable (lowercase, stemming, stopwords removal, ...). How tokenization should be done ? Is there a semantic similarity or not ? If there is a semantic similarity, does a dictionay expansion algorithm is offered ?
So far, I've read this R blog post which gives a starting point for a dictionary based text classification in R. This datascience stack-exchange question seems to be related to mine as well. But none of these give a satisfactory anwser.
So, Is there any straightforward library in Python for this kind of task ?
Thanks in adance for your replies.
Upvotes: 2
Views: 704
Reputation: 192
You're asking for a simple solution while at the same time asserting that the problem is complex and that the frequency of related tokens may not be a strong enough indicator to always predict the right class. What you're alluding to is that you need to understand the context in which your related tokens are mentioned and for that, you will likely need a more complex solution.
One solution you could look at using is zero-shot NLP classification.
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
text = "Yesterday, I ate a yelow-fruit. It was the longest fruit I ever ate."
candidate_labels = ['banana', 'apple']
classifier(text, candidate_labels)
The output is:
{'sequence': 'Yesterday, I ate a yelow-fruit. It was the longest fruit I ever ate.', 'labels': ['banana', 'apple'], 'scores': [0.6739664077758789, 0.3260335922241211]}
Edit: To incorporate further information into the model we can expand the description of the candidate labels as such:
candidate_labels = ['banana yellow tasty long', 'apple pink sphere']
The new output is:
{'sequence': 'Yesterday, I ate a yelow-fruit. It was the longest fruit I ever ate.', 'labels': ['banana yellow tasty long', 'apple pink sphere'], 'scores': [0.9392997622489929, 0.060700222849845886]}
You can see from the addition of the descriptive words the difference between the scores of our two labels has increased.
Note: This attention-based deep learning model uses embeddings to understand the similarity between tokens. Tokens that are close together in this high-dimensional space are proportionally similar in their semantic meaning. Even without specifying additional descriptive words, the model should be able to generalize because it contains knowledge about what a banana or apple is from the millions of text examples it was trained on. This is worth considering when expecting the model to generalize to more abstract examples e.g. wheree classes may contain product SKUs.
Upvotes: 3