Reputation: 81
Here is my example:
I have a big store selling used cars. I want to code a program that can predict car sales in future. I want to use artificial neural network to analysis history data and solve this problem. There are many years sales history.
Network Input:
(Just make it simple.)
Network Output: Days stay in market.
I found a problem very soon when I try to design the neural network. Variables color, manufacture and transmission is different from other 3 variables. Let's say there are 3 colors in total: white, black and red. 3 manufacture: Toyota, Ford and Benz. 3 transmission: manual, auto and CVT.
OK, since "color" is not a number, I cannot input "color" variable as integer. Inputting it as a string also looks not like a good idea. So, I decide to give every color an "id". White is 0, black is 1 and red is 2. However, red is not twice as black and red is not closer to black than white... Same problem to manufacture and transmission.
How can I let the neural network know this integer means an ID, not continuous numbers or quantities? Better with some simple codes.
Upvotes: 2
Views: 47
Reputation: 66805
This is what we call categorical variables, and one of the typical method, which avoids the problem you described (red is not twice as black as black) is to use one hot encoding, so for a variable with K possible values you encode it as K-bit long binary representation, like:
v = {red, black, white}
leads to
red -> [1 0 0]
black->[0 1 0]
white->[0 0 1]
and so on. So you have binary, logical features "is this object red?" and so on.
Upvotes: 3