In scikit learn, how to deal with the data mixed with numerical and nominal value?

Question

I know that the computation in scikit-learn is based on NumPy so everything is a matrix or array.

How does this package handle mixed data (numerical and nominal values)?

For example, a product could have the attribute 'color' and 'price', where color is nominal and price is numerical. I notice there is a model called 'DictVectorizer' to numerate the nominal data. For example, two products are:

products = [{'color':'black','price':10}, {'color':'green','price':5}]

And the result from 'DictVectorizer' could be:

[[1,0,10],
 [0,1,5]]

If there are lots of different values for the attribute 'color', the matrix would be very sparse. And long features will degrade the performance of some algorithms, such as decision trees.

Is there any way to use the nominal value without the need to create dummy codes?

In scikit learn, how to deal with the data mixed with numerical and nominal value?

Answers (1)

Related Questions