Changing parameters in scikit-learn classifier results in UnicodeDecodeError

Question

I have a csv-file with 100k's of lines with a column of free text containing scandinavian characters among others and I'm fitting scikit-learn classifiers to predict True/False (given in another column) for a given piece of text.

I am using this example as a starting point: http://scikit-learn.org/0.15/auto_examples/grid_search_text_feature_extraction.html

The only thing I changed at first is the data, and the training + testing goes fine with useful results.

However, I want to test the liblinear-like LinearSVC classifier, as that might produce better results in some cases. Changing nothing but the classifier to "LinearSVC" or alternatively sticking to SGDClassifier as in the example but changing the loss function to squared_hinge from the default hinge results in

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 312: ordinal not in range(128)

I am assuming this error must rise from the input csv but cannot understand why the initial example from the url runs smoothly and changing the classifier properties with the exact same input data results in this error. Any ideas why this may be?

Secondly, I am not familiar with Python stack tracing and would appreciate any help on how to debug the error / trace it down to the problematic byte. The stack trace is the following:

Traceback (most recent call last):

  File "", line 48, in 
grid_search.fit(data_train.kuvaus, data_train.loukkaantuneita)

  File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\model_selection\_search.py", line 945, in fit
return self._fit(X, y, groups, ParameterGrid(self.param_grid))

  File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\model_selection\_search.py", line 564, in _fit
for parameters in parameter_iterable

  File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 768, in __call__
self.retrieve()

  File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 696, in retrieve
stack_start=1)

  File "C:\Users\x\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\externals\joblib\format_stack.py", line 417, in format_outer_frames
return '
'.join(format_records(output[stack_end:stack_start:-1]))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 312: ordinal not in range(128)

The data_train is Pandas Dataframe with a True/False data_train.loukkaantuneita and a free text (supposed to be utf-8) data_train.kuvaus column.

Changing parameters in scikit-learn classifier results in UnicodeDecodeError

Answers (1)

Related Questions