Reputation: 85
My question is around a specific classification problem I have.
My training data is complete and doesn't have missing data. I can build any classification model around it (SVM, Random forest, etc.) to give good results. So far pigs are not flying in the troposphere.
The problem is I want to apply these models to data with missing features. I am not interested in imputation of any sort. I would like an "uncertainty" measure that goes up the more missing features there are, and I still want the model to spit out a result (even with high uncertainty). For example for one record, if 5 out of 10 features are empty data, the model would give a class but with an uncertainty of 50% (ideally I can specify how "important" each variable is).
I haven't run into anything similar online, and I've been looking for a while. Thanks for your help !
Upvotes: 1
Views: 743
Reputation: 66775
Let us start with very simple model - a linear one (f(x) = sign(<w,x> + b)
). Let us also assume that we are now given a vector with missing value x_i=N/A
and that corresponding weight is non-zero (w_i != 0
); without loss of generality let w_i>0
, then we can see, that I can always "imput" x_i
so small (very, very negative value, like -10e10000
) that the model will answer -1
and symmetrically, so large value that it will output +1
. In order to make a prediction (and further - quantify the certainty) we do need an assumption of possible values of x_i
. I hope that this simple example shows, that without any assumption we are lost - we cannot do anything, no predictions, no certainty - nothing. This is a very well known fact in machine learning - we cannot have a prediction without a model induced bias. In this case - we model missing values.
We need to agree what the values can be. There are many options:
Ok, how is it different from data imputation? Data imputation assumes filling a missing value, it gives you a point. What I am talkin about here is to think about our point with missing values as a probability distribution - so more Bayesian approach. Now it is no longer a data-point, it is an infinite cloud of point, with various densities.
Unfortunately, here things become complex, as it is completely model specific question. So depending on what type of classifier/regressor you use - you need different approach. The simpliest case would be Random Forest, so I will focus on this one, and later give a less efficient, but more general solution for any model.
In decision trees, each node is a decision on some feature. So in order to make a prediction on our "distribution" we simply put the point into the decision process - if node asks about existing feature - we process it normally. What to do when we are asked about missing feature? We split execution, and compute both paths, but with weights computed from out distribution and the threshold in the node. For example, let us assume we have choosen an uniform distribution on [0,1], and threshold is now 0.75 (meaning that this node asks whether a missing value is <0.75 or >=0.75). We split computations and check prediction in both parts, the one with <0.75 decision gets weight 0.75 (as it is an integral INT_{0}^{0.75} pdf(x) dx
, where pdf(x)
is our uniform distribution) and the second path gets weight 0.25. At the end we get the expected value. What is our confidence? You compute is as usuall, or perform more complex analysis of confidence intervals.
The most general method (which can be used for any model as a black box) is a Monte Carlo method. We have our distribution pdf(x)
, so we repeat sampling from this distribution, classify and record the output. At the end - we gather votes and get both classification and confidence. Pseudocode follows
function CLASSIFY_WIT_NA(x, probability_model, model, k=1000):
pdf <- probability_model(x.missing_values)
predictions <- empty set
for i=1 to k
x' ~ pdf
prediction <- model(x')
predictions.add(prediction)
class <- most_common_element_of(predictions)
confidence <- count(class, predictions) / k
Upvotes: 2