Reputation: 1153
I'm using text classification to classify dialects. However, I noticed that I have to use countVectorizer like so:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=200, min_df=2, max_df=0.7, stop_words=stopwords.words('arabic'))
X = vectorizer.fit_transform(X).toarray()
what happens is that I have make a new text file for every line in my csv file. I have collected 1000 tweets from twitter. and they're labeled. and I have them as csv in one file.
I have 2 questions:
Upvotes: 1
Views: 972
Reputation: 2120
No, you dont have to separate every line in a new text file. If you look at the official sklearn document example https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html , you will see how to do it. If you want to follow that example, then you will have to convert your csv column of tweets from dataframe to a list and pass it to the function the same way they did it in the document example.
No, you dont have to use countvectorizer. there are several other ways to do this like Tf-IDF, Word2Vec, bag-of-words, etc. There are several method of converting text to vectors for classification. For your case, I believe TF-IDF or Word2Vec will work fine.
Upvotes: 1