lampShadesDrifter
lampShadesDrifter

Reputation: 4149

How to capture untrained-on values h2o python

How do you capture unknown values when making predictions on h2o data frames?

For example, when doing something like:

model.predict(frame_in)

in the h2o python api, a progress bar loads while the model is making predictions and then a series of lists are outputted detailing the unknown labels seen for each of the enum types of the model predictive features. Eg.

/home/mapr/anaconda2/lib/python2.7/site-packages/h2o/job.py:69: UserWarning:
Test/Validation dataset column 'feature1' has levels not trained on: [, <values>] 

Is there any way to get this set of unknown levels as a python object? Thanks.

When working with h2o MOJOs, there is a java method called getTotalUnknownCategoricalLevelsSeen(), but I could not find anything like this in the h2o python docs.

Upvotes: 3

Views: 481

Answers (2)

Clem Wang
Clem Wang

Reputation: 739

You might consider H2O's GLRM (Generalized Low Rank Model). It can impute missing values.

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glrm.html

Upvotes: 0

lampShadesDrifter
lampShadesDrifter

Reputation: 4149

Ended up temporarily capturing the warning output from stderr. Here is the relevant snippet:

import contextlib
import StringIO


@contextlib.contextmanager
def stderr_redirect(where):
    """
    Temporarily redirect stdout to a specified python object
    see https://stackoverflow.com/a/14197079
    """
    sys.stderr = where
    try:
        yield where
    finally:
        sys.stderr = sys.__stderr__


# make prediction on data
with stderr_redirect(StringIO.StringIO()) as new_stderr:
    preds = est.predict(frame_in)

print 'Prediction complete'
new_stderr.seek(0)
# capture any warning output
preds_stderr = new_stderr.read()

Then used regex to filter to only output lines that contained the column names and list of unseen values, then another regex to filter to get just the list (which I then remove whitespace and .split(',') to get a python string list of values). Can also use regex to get the column name from same line and pair them in a list of tuples.

Upvotes: 1

Related Questions