mart1n
mart1n

Reputation: 6213

scikit-learn: classifying texts using custom labels

I have a large training set of words labeled pos and neg to classify texts. I used TextBlob (according to this tutorial) to classify texts. While it works fairly well, it can be very slow for a large training set (e.g. 8k words).

I would like to try doing this with scikit-learn but I'm not sure where to start. What would the above tutorial look like in scikit-learn? I'd also like the training set to include weights for certain words. Some that should pretty much guarantee that a particular text is classed as "positive" while others guarantee that it's classed as "negative". And lastly, is there a way to imply that certain parts of the analyzed text are more valuable than others?

Any pointers to existing tutorials or docs appreciated!

Upvotes: 0

Views: 687

Answers (2)

sansingh
sansingh

Reputation: 195

There are many ways to do this like Tf-Idf (Term frequency - Inverse Document Frequency), Count Vectorizer, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Word2Vec.

Among all of above mentioned methods, Word2Vec is the best method. You can use a pre-trained model by Google for Word2Vec, available on:

https://github.com/mmihaltz/word2vec-GoogleNews-vectors

Upvotes: 0

Alex
Alex

Reputation: 12913

There is an excellent chapter on this topic in Sebastian Raschka's Python Machine Learning book and the code can be found here: https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb.

He does sentiment analysis (what you are trying to do) on an IMDB dataset. His data is not as clean as yours - from the looks of it - so he needs to do a bit more pre-processing work. Your problem can be solved with these steps:

  1. Create numerical features by vectorizing your text: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

  2. Train test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

  3. Train and test your favourite model, e.g.: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Upvotes: 1

Related Questions