Yosr_mzg
Yosr_mzg

Reputation: 13

categorize large text using machine learning

I have a large xls document where each row contains a problem id , its description and its category for expl: category 1- A- a1

I am trying to build a machine learning model that helps me to classify this document according to the categories. Goal is for each new entry ( aka new problem description), the model we ll be able to define its respective category.

Constraints: I have more than 10 Categories which are also hierarchical ( category 1 has different subcategories and each subcategory has different subsubcategories). I am thinking of hierarchical classification or multiclass classification But can't tell.

the description feature is a long text. I am thinking of multinomial logistic regression but I read that it needs numerical data. Do I have to make dictionary of all words used in the document to give it a numerical value? is this a correct choice ?

I also want to have scores for each new entry to classify the nearest classes ( for a new description X , the category 2-B-b1 gives an 80% score)

Upvotes: 1

Views: 962

Answers (2)

Alain T.
Alain T.

Reputation: 42143

One of the strategies you could use is a Bernoulli Naive Bayes (https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

This is a simple mathematical equation that allows you to reduce the problem to simple lists of word frequencies per category.

Once you have established a meaningful baseline of word frequencies for texts that are known to be in the appropriate categories, the formula will be able to return a probability of match in each category for a new text.

This can give a very large matrix of words x categories but the processing of each element is very simple. Depending on your volumes and on the performance requirements, there is an optimization of the formula that can be done to limit calculations to the words that are actually present in the text to categorize and skip the factors that are linked to other words that have been seen before but are not present in the text (I could elaborate on that if the Bernoulli classifier is relevant to your solution). Note that there could be existing implementations of the classifier in Python (I haven't checked).

Upvotes: 1

Nishant Ravindran
Nishant Ravindran

Reputation: 54

A good approach would be for you to convert you .xls file to a pandas dataframe and using fasttext https://fasttext.cc/ create a model for text classification, any new text would be classified into its respective categories. Refer https://github.com/facebookresearch/fastText for proper documentation.

Upvotes: 1

Related Questions