Reputation: 514
I am using a regression model to predict numeric values from a set of 120 attributes. 7 of these attributes are Categorical, however the largest category has about 90,000 unique values. I am training with approximately 1 million rows of data.
However, when I look at the Categorical attributes in the datasource summary I can see that these show a maximum of 5000 unique values. Is this some kind of limit that AWS Machine Learning is enforcing which is affecting the accuracy of my model, or is it just a limitation of the summary display?
Also, I have highlighted the Most frequent categories results where blank is shown as the most common value. (And this could be because of my CSV including quotes, and thus a valid value) Does AWS ML ignore blank entries for categorical elements? Or should I be populating missing categorical values with UUIDs/random strings so that a common shared 'blank' value doesn't skew predictions.
I understand that some ML models keep a spare neuron around for when new (previously unseen in training) categorical values are entered for predictions. Is this the case with AWS Machine Learning?
I am a ML novice, so sorry if my questions are stupid, or my methods/assumptions are wrong. I did scan the AWS documentation before asking.
Thanks.
Upvotes: 0
Views: 189
Reputation: 12939
It usually doesn't make much sense to use so many category values, and only the top values will be used as the other smaller categories don't have much predictive power.
These categories have a very high correlation to the target, which is a bit suspicious. But if the model is working well with them, I wouldn't be too worry. You can try to build the model without them, to see if it makes any difference, but I won't work too hard on selecting features, and more on adding more potential ones.
Upvotes: 1