Learner_seeker
Learner_seeker

Reputation: 544

h2o MOJO prediction vs h2o.predict both different for GBM

I am getting different predictions for the same test data set from both h2o.predict and h2o.mojo_predict_df. When compared - roughtly 50% of records have same probabilities but 50% are different with some where probabilities change drastically =e.g. 0.88 to 0.55 for the same class.

The modelling algorithm used is h2o.gbm and h2o.download_mojo(gbm_model,get_genmodel_jar = T)

I am trying to research and have found a few more posts with similar questions but no solution :

Reproduce predictions with MOJO file of a H2O GBM model

GLM model: h2o.predict gives very different results depending on number of rows used in the validation data

Why do I get different predictions with MOJO?

The codes used so far are as below :

# h2o start the cluster


h2o.init(nthreads=10,min_mem_size = '80g')

# variables 

predictors=c(1:76,78:681)
response=77

# getting datasets ready 

model_ready_df = model_ready_df %>% mutate_if(is.character,as.factor)
train.h2o = as.h2o(model_ready_df)
poc_test = poc_test %>% mutate_if(is.character,as.factor)
test.h2o <- as.h2o(poc_test)


# build model 

gbm_model <- h2o.gbm(x = predictors, y =response, training_frame = train.h2o , seed = 0xDECAF,ntrees = 1000, max_depth = 4,
                     learn_rate = 0.1,stopping_rounds=50,min_rows = 50,distribution = "bernoulli",ignore_const_col=F,
                     histogram_type='QuantilesGlobal',sample_rate=0.7,col_sample_rate=0.7,keep_cross_validation_models = T)


# save model object

h2o.download_mojo(gbm_model,get_genmodel_jar = T)

# predict 

preds=as.data.frame(h2o.predict(gbm_model,test.h2o))
preds2=h2o.mojo_predict_df(poc_test, 'GBM_model_R_1576045840818_1.zip',genmodel_jar_path = 'h2o-genmodel.jar',verbose = F)

# save 

fwrite(preds,"pred_usual.csv")
fwrite(preds2,"pred_mojo.csv")

example

enter image description here

Upvotes: 2

Views: 908

Answers (1)

Learner_seeker
Learner_seeker

Reputation: 544

h2o.mojo_predict_df converts the data frame into a csv and then essentially runs h2o.mojo_predict_csv. Hence in this process of writing and parsing the variables - certain variables may have formats which are incorrectly written in the csv and hence leads to difference in results. one example is scientific notation in R , if your numbers are displayed as e+10. When these are written into the csv , the formats get mixed up. Use options(scipen=999) to correct for this and then run the mojo functions. The results should be the same.

Upvotes: 0

Related Questions