Aaditya Ura
Aaditya Ura

Reputation: 12669

How to use multiple text features for NLP classifier?

I am trying to build text classifier, Usually, we have one text column and ground truth. But I am working on a problem where dataset contains many text features. I am exploring different ways how to make use of different text features.

For example, my dataset looks like this

Index_no                   domain  comment_by   comment       research_paper      books_name

01                         Science  Professor   Thesis needs  Evolution of         MOIRCS 
                                                more work     Quiescent            Deep 
                                                              Galaxies as a        Survey
                                                              Function of
                                                              Stellar Mass       



02                         Math    Professor   Doesn't follow  Evolution of   
                                               Latex format   Quiescent           nonlinear 
                                                              Galaxies as a       dispersive
                                                              Function of         equations
                                                              Stellar Mass             

This is just a dummy dataset, Here my ground truth (Y) is domain and features are comment_by, comment, research_paper, books_name

If I am using any NLP model (RNN-LSTM, Transformers etc), those models usually take one 3 dim vectors, for that if I am using one text column that works but How to many text features for text classifier?

What I've tried :

1) Joining all column and making a long string

Professor Thesis needs more work Evolution of Quiescent Galaxies as a Function of Stellar Mass MOIRCS Deep Survey

2) Using a token between columns

<CB> Professor <C> Thesis needs more work <R> Evolution of Quiescent Galaxies as a Function of Stellar Mass <B> MOIRCS Deep Survey 

where <CB> comment_by , <C> comment, <R> research_paper, <B> books_name

Should I use <CB> at the beginning or use like this?

Professor <1> Thesis needs more work <2> Evolution of Quiescent Galaxies as a Function of Stellar Mass <3> MOIRCS Deep Survey

3) Using different dense layers (or embedding) for each column and concatenate them.

I've tried all three approaches, Is there any other approach I can try to improve the model accuracy? or extract, combine, join the better features?

Thanks in advance!

Upvotes: 7

Views: 1986

Answers (1)

spectre
spectre

Reputation: 767

Here are some of the things you could try:

1.) Combine research_paper, book_name and comment into one string.

2.) Treat comment_by as a categorical variable and encode it using one hot encoder or label encoder.

3.) Whatever model you are using, tune the hyperparameters to improve the results.

Do let me know the results you got!

Upvotes: 1

Related Questions