Damian Grzanka
Damian Grzanka

Reputation: 305

How to find if the topic is mentioned in the sentence ? - nlp

I am pretty new to NLP, and I am looking for the most appropriate solution for my problem.

In simplification, I want to create a "tag list" from the title.

Tags are predefined, and I can easily label examples for training.

Simple examples:

Format "exemplary sentence" - "exemplary tag list"

I don't need the specific value of the tag

e.g. tags = { Animal: Elephant } is as useful as tags = [Animals]

I could find the only solutions that extract the entity. I only came up with a building list of matcher and then trying them all, is there any clever and performant way to do it?

Thanks for any suggestions, tips, and resources, Have a nice day :)

Upvotes: 2

Views: 1617

Answers (2)

Moritz
Moritz

Reputation: 3225

You could build your own custom classifier (as suggested by polm23), but given that you are new to NLP this might be too complicated and time consuming.

An exciting new way of doing this is with so-called "zero-shot classification". This basically means that you take a general machine learning model that has been pre-trained by someone else in a very general way for text classification and you simply apply it to your specific use case without having to train/fine-tune it. The HuggingFace Transformers library has a very easy to use implementation of this. Here is an interactive web application to see what it does without coding. Here is a Jupyter notebook which demonstrates how to use it in Python. You can just copy-paste code from the notebook.

Concretely applied to your use-case, this would look something like this:

# pip install transformers==3.1.0  # pip install in terminal
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

sequence = "The biggest elephant in the world"
candidate_labels = ["animals", "fruits", "diseases"]

classifier(sequence, candidate_labels)

# output: {'sequence': 'The biggest elephant in the world', 
# 'labels': ['animals', 'diseases', 'fruits'], 
# 'scores': [0.9948041439056396, 0.0035726651549339294, 0.0016232384368777275]}

If you want that the algorithm attributes more than one label to the text, you can activate multi-label classification and it will consider more than one label per text.

sequence = "I like mangos and gorillas"
candidate_labels = ["animals", "fruits", "diseases"]

classifier(sequence, candidate_labels, multi_class=True)

# output: {'sequence': 'I like mangos and gorillas', 
# 'labels': ['animals', 'fruits', 'diseases'], 
# 'scores': [0.9978452920913696, 0.989518404006958, 0.00015786082076374441]}

=> In your words: It "creates a 'tag list' " for each text. i.e. For each of the predefined tags, it provides a confidence score and you can then just select the tags with the highest confidence score for your 'true tags list'.

I tested it and the actual outputs are in the code above. It classified everything correctly :)

It tried it on other use cases and it's not 100% accurate, but it's pretty good, given the fact that the code is super simple and you don't have to train a model yourself. Here are details on the theory, if you are interested.

Upvotes: 1

polm23
polm23

Reputation: 15593

What you want to do is called multi-label classification. Your "tags" are labels, and each document can have more than one of them.

A typical way to implement this is to train a binary classifier for each label, and then consider the labels that are above a threshold in their predictions to be positive.

spaCy supports multi-label classification. It's not called out in the tutorial for textcat, but you can add more classes than POS and NEG and it should be able to learn them.

Upvotes: 1

Related Questions