Reputation: 31
I have a dataframe of 50 rows (subjects) and 572288 columns (variables)
When parsing the data.frame
into an h2o object I lose variables and end up with
51 rows and 419431 variables.
It does not change if I reduce the number of rows or increase them.
library("data.table")
library("h2o")
options("h2o.use.data.table"=T)
h2o.init()
trainset=as.data.frame(matrix(ncol=572288,nrow=50,1))
fwrite(trainset, "train.csv", sep=",")
train=h2o.importFile("train.csv", sep=",")
dim(trainset)
dim(train)
My output is:
> h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 hours 2 minutes
H2O cluster timezone: Europe/Berlin
H2O data parsing timezone: UTC
H2O cluster version: 3.18.0.11
H2O cluster version age: 3 months
H2O cluster name: H2O_started_from_R_chiocchetti_lub856
H2O cluster total nodes: 1
H2O cluster total memory: 9.84 GB
H2O cluster total cores: 24
H2O cluster allowed cores: 20
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.3 (2017-11-30)
> trainset=as.data.frame(matrix(ncol=572288,nrow=50,1))
> fwrite(trainset, "train.csv", sep=",")
>
> train=h2o.importFile("train.csv", sep=",")
|======================================================================|100%
> dim(train)
[1] 51 538177
> dim(trainset)
[1] 50 572288
It seems to me that I am running in some kind of memory issue when reading back the lines from the file. However, I have no idea how to overcome this problem.
The final aim is to do a randomForest.
Upvotes: 3
Views: 635
Reputation: 5778
This is likely a bug; I've created a jira ticket for it here: https://0xdata.atlassian.net/browse/PUBDEV-5860.
please feel to update the ticket if you have a jira account.
Upvotes: 2