Reputation: 880
I'm trying to classify product items in order to predict their category based on the product title and their base price.
An example(product title, price, category):
['notebook sony vaio vgn-z770td dockstation', 3000.0, u'MLA54559']
Previously I was only using product title for the prediction task but I'd like to include the price to see if the accuracy improves.
The problem with my code is that I can't merge the text/numeric features, I've been reading some questions here in SO and this is my code excerpt:
#extracting features from text
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform([e[0] for e in training_set])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
#extracting numerical features
X_train_price = np.array([e[1] for e in training_set])
X = sparse.hstack([X_train_tfidf, X_train_price]) #this is where the problem begins
clf = svm.LinearSVC().fit(X, [e[2] for e in training_set])
I try to merge the data types with sparse.hstack but I get the following error:
ValueError: blocks[0,:] has incompatible row dimensions
I guess the problem lies in X_train_price(a list of prices) but I don't know how to format it for the sparse function to succesfully work.
These are the shapes of both arrays:
>>> X_train_tfidf.shape
(65845, 23136)
>>>X_train_price.shape
(65845,)
Upvotes: 2
Views: 1895
Reputation: 150977
It looks to me like this should be as simple as stacking the arrays. If scikit-learn follows the conventions I'm familiar with, then each row in X_train_tfidf
is a training datapoint, and there are a total of 65845 points. So you just have to do an hstack
-- as you said you tried to do.
However, you need to make sure the dimensions are compatible! In vanilla numpy
you get this error otherwise:
>>> a = numpy.arange(15).reshape(5, 3)
>>> b = numpy.arange(15, 20)
>>> numpy.hstack((a, b))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/numpy/core/shape_base.py", line 270, in hstack
return _nx.concatenate(map(atleast_1d,tup),1)
ValueError: arrays must have same number of dimensions
Reshape b
to have the correct dimensions -- noting that a 1-d array of shape (5,)
is totally different from a 2-d array of shape (5, 1)
.
>>> b
array([15, 16, 17, 18, 19])
>>> b.reshape(5, 1)
array([[15],
[16],
[17],
[18],
[19]])
>>> numpy.hstack((a, b.reshape(5, 1)))
array([[ 0, 1, 2, 15],
[ 3, 4, 5, 16],
[ 6, 7, 8, 17],
[ 9, 10, 11, 18],
[12, 13, 14, 19]])
So in your case, you want an array of shape (65845, 1)
instead of (65845,)
. I might be missing something because you are using sparse arrays. Nonetheless, the principle ought be the same. I have no idea what sparse format you're using based on the above code, so I just picked one to test:
>>> a = scipy.sparse.lil_matrix(numpy.arange(15).reshape(5, 3))
>>> scipy.sparse.hstack((a, b.reshape(5, 1))).toarray()
array([[ 0, 1, 2, 15],
[ 3, 4, 5, 16],
[ 6, 7, 8, 17],
[ 9, 10, 11, 18],
[12, 13, 14, 19]])
Upvotes: 4