Flavio
Flavio

Reputation: 839

With open() statement with Naive Bayes Classifier takes to long

I have a csv file with 3483 lines and 460K characters and 65K words, and I'm trying to use this corpus to train a NaiveBayes classifier in Scikit-learn.

The problem is when I use this statement below, takes too long (1 hour and did not finish).

from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier 
import csv 

with open('train.csv', 'r') as fp:
    cl = NaiveBayesClassifier(fp, format="csv") 

Any guesses of what I doing wrong?

Thanks in advance.

Upvotes: 0

Views: 321

Answers (2)

SalazarSid
SalazarSid

Reputation: 64

I am not entirely sure of the text blob library but perhaps this may help-

I had written the following code to train a multinomial naive bayes model with raw textual data after vectorizing and transforming the text in my dataset.

from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd
from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

#import dataset
url = ("C:\\Users\\sidharth.m\\Desktop\\Project_sid_35352\\Final.csv")
documents = pd.read_csv(url)

array = documents.values

x = array[0:, 1]

y= array[0:, 0]


count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(x)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model=MultinomialNB().fit(X_train_tfidf, y)

predicted = model.predict(X_train_tfidf)

acc = accuracy_score(y, predicted)
print(acc)

Upvotes: 0

Flavio
Flavio

Reputation: 839

There's a problem with this lib.

It's documented in the following links:

https://github.com/sloria/TextBlob/pull/136

https://github.com/sloria/TextBlob/issues/77

Small story: The library do not deals well with large datasets.

Upvotes: 1

Related Questions