Using scikit-learn to train on multidimensional data

Question

It's a very basic concept: I have more than one dependency for training. My data is all text and I have three separate fields. Every example I have been able to find has text data set up like this:

data = ['text1','text2',...]

where mine looks like:

data = [['text1','text2','text3'],[...],...]

but when I try and fit to the data I get the following traceback:

ValueError                                Traceback (most recent call last)
 in ()
----> 1 classifier.fit(X,y)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.pyc in fit(self, X, y, sample_weight)
    140                              "by not using the ``sparse`` parameter")
    141 
--> 142         X = atleast2d_or_csr(X, dtype=np.float64, order='C')
    143 
    144         if self.impl in ['c_svc', 'nu_svc']:

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.pyc in atleast2d_or_csr(X, dtype, order, copy)
    114     """
    115     return _atleast2d_or_sparse(X, dtype, order, copy, sparse.csr_matrix,
--> 116                                 "tocsr")
    117 
    118 

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _atleast2d_or_sparse(X, dtype, order, copy, sparse_class, convmethod)
     94         _assert_all_finite(X.data)
     95     else:
---> 96         X = array2d(X, dtype=dtype, order=order, copy=copy)
     97         _assert_all_finite(X)
     98     return X

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.pyc in array2d(X, dtype, order, copy)
     78         raise TypeError('A sparse matrix was passed, but dense data '
     79                         'is required. Use X.toarray() to convert to dense.')
---> 80     X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
     81     _assert_all_finite(X_2d)
     82     if X is X_2d and copy:

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    318 
    319     """
--> 320     return array(a, dtype, copy=False, order=order)
    321 
    322 def asanyarray(a, dtype=None, order=None):

ValueError: setting an array element with a sequence.

is there a specific way I have to approach this? Thank you!

NOTES:

All of the text data I am using is vectorized by a HashingVectorizer

clf.fit(X,y) where X is a list of lists that contain 3 vectorized texts, and y is a list of the respective categories that the element of X belongs to

ojy · Accepted Answer

X has to be a 2 dimensional array (or list of lists, if you want). And each list in this list of lists has to be a list of numeric values. And all this lists must have the same length. Like this: [[1,2,3,5],[3,4,5,6],[6,7,8,9],...]. If for each object you have several text entries which you are vectorizing, you need to combine the resultant vectorized texts into a single list. For example, concatenating them, if it makes sense in your context. So eventually each object has to be represented by a single list where all entries are numeric. And all objects must be represented by lists of equal length, where corresponding elements in all the lists represent the same feature (e.g. frequency of the same token in your texts). Let me know whether what I'm saying makes sense.

Using scikit-learn to train on multidimensional data

Answers (1)

Related Questions