Error in classifying SVM texts

Question

I'm trying to apply a text sorting algorithm and unfortunately I have an error

import sklearn
import numpy as np
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd
import pandas

dataset = pd.read_csv('train.csv', encoding = 'utf-8')
data = dataset['data']
labels = dataset['label']

X_train, X_test, y_train, y_test = train_test_split (data.data, labels.target, test_size = 0.2, random_state = 0)


vecteur = CountVectorizer()
X_train_counts = vecteur.fit_transform(X_train)

tfidf = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

#SVM
clf = svm.SVC(kernel = 'linear', C = 10).fit(X_train, y_train)
print(clf.score(X_test, y_test))

I have the following error:

Traceback (most recent call last):

File "bayes_classif.py", line 22, in

dataset = pd.read_csv('train.csv', encoding = 'utf-8')

File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 678, in parser_f

return _read(filepath_or_buffer, kwds)

File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 446, in _read

data = parser.read(nrows)

File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1036, in read

ret = self._engine.read(nrows)

File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1848, in read

data = self._reader.read(nrows)

File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read

File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory

File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows

File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows

File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 72, saw 3

My data

data, label
bought noon  provence  shop givors moment  bad surprise  made account price  catalog expect part minimum refund difference wait read brief delay, refund

parcel ordered friend n still not arrive possible destination send back pay pretty unhappy act gift birth  status parcel n not moved weird think lost stolen share quickly solutions can send gift both time good , call

ordered  coat recovered calais city europe shops n not used assemble parties up  thing done  bad surprise parties not aligned correctly can see photo can exchange made refund man, annulation

note  important traces rust articles come to buy acting carrying elements going outside extremely disappointed wish to return together immediately full refund indicate procedure sabrina beillevaire , refund

note  important traces rust articles come to buy acts acting bearing elements going outside extremely disappointed wish to return together immediately full refund indicate procedure , annulation

request refund box jewelry arrived completely broken box n not protected free delivery directly packaging plastic item fragile cardboard box  interior shot cover cardboard torn corners  completely broken, call

ysearka · Accepted Answer

Can you try to reproduce the same error with a clean code? Yours contain a few mistakes, and unnecessary lines. We also need a sample of your data that helps reproduce the error otherwise we won't be able to help.

Here is what I assume is you are trying to do, please try to launch it with your data and tell us if you still obtain the same error:

import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

dataset = pd.DataFrame({'data':['A first sentence','And a second sentence','Another one','Yet another line','And a last one'],
                    'label':[1,0,0,1,1]})
data = dataset['data']
labels = dataset['label']

X_train, X_test, y_train, y_test = train_test_split (data, labels, test_size = 0.2, random_state = 0)


vecteur = CountVectorizer()
tfidf = TfidfTransformer()

X_train_counts = vecteur.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)
X_test_tfidf = tfidf.transform(vecteur.transform(X_test))

clf = svm.SVC(kernel = 'linear', C = 10).fit(X_train_tfidf, y_train)
print(clf.score(X_test_tfidf, y_test))

EDIT:

According to your data, the error might be due to a comma character in your csv file, causing pandas parser to bug. You can tell pandas to ignore such rows by using erro_bad_lines argument in read_csv. Here is a short example:

temp=u"""data, label
A first working line, refund
a second ok line, call
last line with an inside comma: , character which makes it bug, call"""
df = pd.read_csv(pd.compat.StringIO(temp),error_bad_lines=False)

Error in classifying SVM texts

Answers (1)

Related Questions