elvikingo
elvikingo

Reputation: 987

How to interpret H2O's time data type?

I have a data frame in R that I am passing to H2O using the as.h2o().

dataset.h2o <- as.h2o(dataset,destination_frame = "dataset.h2o")

Doing an str() on the data frame, we can see that the week_of_date class is of datatype Date

$ primary_account_id : int 31 31 31 31 31 31 31 31 31 31 ...
$ week_of_date : Date, format: "2015-08-31" "2015-09-07" "2015-09-14" "2015-09-21" ...

However, when viewed in H2O Flow, it seems to be converted to a datatype called time - which is of the format

week_of_date time 0 0 0 0 1440943200000.0 1447592400000.0 1444480409625.8884 2013362534.5706

When I bring back the data to R using as.data.frame

returned.dataset <- as.data.frame(dataset.h2o)

it is stored in a format that I am unable to understand and therefore parse back

$ primary_account_id: int 31 31 698 1060 1060 1060 1060 1060 1060 1133 ...
$ week_of_date :Class 'POSIXct' num [1:194] 1442757600000 1446382800000 1446382800000 1442152800000 1442757600000 ...

Could you please point me in the direction of how I can achieve better interoperability with dates between R and H2O?

Thanks!

Upvotes: 3

Views: 3386

Answers (4)

kangaroo_cliff
kangaroo_cliff

Reputation: 6222

Converting to H2o and back is easy if the date-time columns are in the proper format. (Accuracy of times in milliseconds cab be lost). As mentioned in the H20 FAQ

H2O is set to auto-detect two major date/time formats. The first format is for dates formatted as yyyy-MM-dd. ... The second date format is for dates formatted as dd-MMM-yy.

Times are specified as HH:mm:ss. HH is a two-digit hour and must be a value between 0-23 (for 24-hour time) or 1-12 (for a twelve-hour clock). mm is a two-digit minute value and must be a value between 0-59. ss is a two-digit second value and must be a value between 0-59.

Example

Example Data

dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
times <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26")
x <- paste(dates, times)
df <- data.frame(datetime = strptime(x, "%m/%d/%y %H:%M:%S"))
# > df
#              datetime
# 1 1992-02-27 23:03:20
# 2 1992-02-27 22:29:56
# 3 1992-01-14 01:03:30
# 4 1992-02-28 18:21:03
# 5 1992-02-01 16:56:26

Change the format to one that H2o prefers

# Change format 
df$datetime <- format(df$datetime, format = "%Y-%m-%d %H:%M:%S")

#H2o format
h2o_df <- as.h2o(df)

# Convert back
back_df <- as.data.frame(h2o_df)

back_df
#              datetime
# 1 1992-02-27 23:03:20
# 2 1992-02-27 22:29:56
# 3 1992-01-14 01:03:30
# 4 1992-02-28 18:21:03
# 5 1992-02-01 16:56:26

Upvotes: 0

Abdul Basit Khan
Abdul Basit Khan

Reputation: 724

Both answers above are great. However, my workaround which I deem more efficient would be to pass the dataset to h2o excluding the date column. Then when you train a model and then make predictions, these would have the same amount of fields/rows as that of the original dataset for which you could just attach the Date column to the predictions vector or matrix.

Of course, the predictions in this solutions is related to the period as for backtesting.

Upvotes: 0

jmuhlenkamp
jmuhlenkamp

Reputation: 2150

Refer to the response by phiver for a more detailed answer, but another simple workaround would be to convert the date columns to character before passing to H2O (if you do not need the column in a date format in H2O). Here is a simple example.

# construct a sample df with a date format column
df <- data.frame(week_of_date = as.Date(c('2015-09-29','2015-10-05')))
str(df$week_of_date)
Date[1:2], format: "2015-09-29" "2015-10-05"

# convert the column to H2O
df$week_of_date <- as.character(df$week_of_date)
str(df$week_of_date)
chr [1:2] "2015-09-29" "2015-10-05"

# convert to H2OFRAME and pass back to R data.frame and re-convert to date
df.hex <- as.h2o(df)
df2 <- as.data.frame(df.hex)
df2$week_of_date <- as.Date(df2$week_of_date)
str(df2$week_of_date)
Date[1:2], format: "2015-09-29" "2015-10-05"

Upvotes: 0

phiver
phiver

Reputation: 23608

It is a bug in h2o. H2o returns date time in milliseconds while R expects seconds. See jira issue 3434.

What you can do in the meantime is recode the date column: as.Date(structure(returned.dataset$week_of_date/1000, class = c("POSIXct", "POSIXt")))

Upvotes: 2

Related Questions