Reputation: 839
I have a csv file with 3483 lines and 460K characters and 65K words, and I'm trying to use this corpus to train a NaiveBayes classifier in Scikit-learn.
The problem is when I use this statement below, takes too long (1 hour and did not finish).
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier
import csv
with open('train.csv', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="csv")
Any guesses of what I doing wrong?
Thanks in advance.
Upvotes: 0
Views: 321
Reputation: 64
I am not entirely sure of the text blob library but perhaps this may help-
I had written the following code to train a multinomial naive bayes model with raw textual data after vectorizing and transforming the text in my dataset.
from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd
from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
#import dataset
url = ("C:\\Users\\sidharth.m\\Desktop\\Project_sid_35352\\Final.csv")
documents = pd.read_csv(url)
array = documents.values
x = array[0:, 1]
y= array[0:, 0]
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(x)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
model=MultinomialNB().fit(X_train_tfidf, y)
predicted = model.predict(X_train_tfidf)
acc = accuracy_score(y, predicted)
print(acc)
Upvotes: 0
Reputation: 839
There's a problem with this lib.
It's documented in the following links:
https://github.com/sloria/TextBlob/pull/136
https://github.com/sloria/TextBlob/issues/77
Small story: The library do not deals well with large datasets.
Upvotes: 1