Reputation: 2503
I am using h2o to perform predictive modeling from python. I have loaded some data from a csv using pandas, specifying some column types:
dtype_dict = {'SIT_SSICCOMP':'object',
'SIT_CAPACC':'object',
'PTT_SSIRMPOL':'object',
'PTT_SPTCLVEI':'object',
'cap_pad':'object',
'SIT_SADNS_RESP_PERC':'object',
'SIT_GEOCODE':'object',
'SIT_TIPOFIRMA':'object',
'SIT_TPFRODESI':'object',
'SIT_CITTAACC':'object',
'SIT_INDIRACC':'object',
'SIT_NUMCIVACC':'object'
}
date_cols = ["SIT_SSIDTSIN","SIT_SSIDTDEN","PTT_SPTDTEFF","PTT_SPTDTSCA","SIT_DTANTIFRODE","PTT_DTELABOR"]
columns_to_drop = ['SIT_TPFRODESI','SIT_CITTAACC',
'SIT_INDIRACC', 'SIT_NUMCIVACC', 'SIT_CAPACC', 'SIT_LONGITACC',
'SIT_LATITACC','cap_pad','SIT_DTANTIFRODE']
comp='mycomp'
file_completo = os.path.join(dataDir,"db4modelrisk_"+comp+".csv")
db4scoring = pd.read_csv(filepath_or_buffer=file_completo,sep=";", encoding='latin1',
header=0,infer_datetime_format =True,na_values=[''], keep_default_na =False,
parse_dates=date_cols,dtype=dtype_dict,nrows=500e3)
db4scoring.drop(labels=columns_to_drop,axis=1,inplace =True)
Then, after I set up a h2o cluster I import it in h2o using db4scoring_h2o = H2OFrame(db4scoring)
and I convert categorical predictors in factor for example:
db4scoring_h2o["SIT_SADTPROV"]=db4scoring_h2o["SIT_SADTPROV"].asfactor()
db4scoring_h2o["PTT_SPTFRAZ"]=db4scoring_h2o["PTT_SPTFRAZ"].asfactor()
When I check data types using db4scoring.dtypes I notice that they are properly set but when I import it in h2o I notice that h2oframe performs some unwanted conversions to enum (eg from float or from int). I wonder if is is a way to specify the variable format in H2OFrame.
Upvotes: 1
Views: 2524
Reputation: 1
# Filter a dictionary to keep elements only whose keys are even
newDict = filterTheDict(dictOfNames, lambda elem : elem[0] % 2 == 0)
print('Filtered Dictionary : ')
print(newDict)`enter code here`
Upvotes: 0
Reputation: 71
Yes, there is. See the H2OFrame doc here: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
You just need to use the column_types
argument when you cast.
Here's a short example:
# imports
import h2o
import numpy as np
import pandas as pd
# create small random pandas df
df = pd.DataFrame(np.random.randint(0,10,size=(10, 2)),
columns=list('AB'))
print(df)
# A B
#0 5 0
#1 1 3
#2 4 8
#3 3 9
# ...
# start h2o, convert pandas frame to H2OFrame
# use column_types dict to set data types
h2o.init()
h2o_df = h2o.H2OFrame(df, column_types={'A':'numeric', 'B':'enum'})
h2o_df.describe() # you should now see the desired data types
# A B
# type int enum
# ...
Upvotes: 3