Reputation: 881
I have a data frame in R that I have imported from a CSV. The "time" format in the csv is "%Y-%m-%d %H:%M:%S" like so:
> head(btc_data)
time btc_price
1 2017-08-27 22:50:00 4,389.6113
2 2017-08-27 22:51:00 4,389.0850
3 2017-08-27 22:52:00 4,388.8625
4 2017-08-27 22:53:00 4,389.7888
5 2017-08-27 22:56:00 4,389.9138
6 2017-08-27 22:57:00 4,390.1663
When I run str(btc_data)
the time column comes back as a factor. So, I have converted this to datetime using the lubridate package as follows:
btc_data$time <- ymd_hms(as.character(btc_data$time))
The problem is the data collected at midnight (5 rows) fail to parse and return NA values like this (in the original data the timestamp is missing from these rows so 2017-08-29 00:00:00
is listed simply as 2017-08-29
) -
724 2017-08-28 23:59:00 4,439.3313
725 NA 4,439.6588
726 2017-08-29 00:01:00 4,440.3050
Moreover, the second data frame is organized differently:
> str(eth_data)
'data.frame': 1081 obs. of 2 variables:
$ time : Factor w/ 1081 levels "8/28/17 16:19",..: 1 2 3 4 5 6 7 8 9 10 ...
$ eth_price: num 344 344 344 344 343 ...
When I try:
> eth_data$time <- mdy_hms(as.character(eth_data$time))
I get the following error:
Warning message: All formats failed to parse. No formats found.
EDIT: I have isolated the code issue that is causing the problem:
> btc_data[721:726,]
time btc_price
721 2017-08-28 23:57:00 4,439.8163
722 2017-08-28 23:58:00 4,440.2363
723 2017-08-28 23:58:00 4,440.2363
724 2017-08-28 23:59:00 4,439.3313
725 2017-08-29 4,439.6588
726 2017-08-29 00:01:00 4,440.3050
So, each time the clock strikes midnight- the timestamp is not recorded. The CSV is being created via a data stream and is constantly growing, so this issue will continue to occur with each new day unless I can find a workaround. Any suggestions?
Upvotes: 1
Views: 1695
Reputation: 4224
If the '00:00:00' is completely missing in the original data to begin with, you can use grep to find those cases, then paste '00:00:00' before using the ymd_hms() or mdy_hm() function.
First case, where date/time format is 'YYYY-mm-dd HH:MM:SS':
#Before
test <- fread("time, btc_price
2017-08-28 23:57:00, 4439.8163
2017-08-28 23:58:00, 4440.2363
2017-08-28 23:58:00, 4440.2363
2017-08-28 23:59:00, 4439.3313
2017-08-29 , 4439.6588
2017-08-29 00:01:00, 4440.3050")
test$time[grep("[0-9]{4}-[0-9]{2}-[0-9]{2}$",test$time)] <- paste(
test$time[grep("[0-9]{4}-[0-9]{2}-[0-9]{2}$",test$time)],"00:00:00")
#After
print(test)
time btc_price
1: 2017-08-28 23:57:00 4439.816
2: 2017-08-28 23:58:00 4440.236
3: 2017-08-28 23:58:00 4440.236
4: 2017-08-28 23:59:00 4439.331
5: 2017-08-29 00:00:00 4439.659
6: 2017-08-29 00:01:00 4440.305
#Now you can use ymd_hms(as.character(df$date)) as usual.
Second case, where date/time format is 'm/dd/yy HH:MM':
#Step 1 is to find/replace:
test <- fread("time, btc_price
8/28/17 23:57, 4439.8163
8/28/17 23:57, 4440.2363
8/28/17 23:57, 4440.2363
8/28/17 23:57, 4439.3313
8/28/17 , 4439.6588
8/29/17 00:01, 4440.3050")
test$time[grep("[0-9]{1}/[0-9]{2}/[0-9]{2}$",test$time)] <- paste(
test$time[grep("[0-9]{1}/[0-9]{2}/[0-9]{2}$",test$time)],"00:00"
)
print(test)
time btc_price
1: 8/28/17 23:57 4439.816
2: 8/28/17 23:57 4440.236
3: 8/28/17 23:57 4440.236
4: 8/28/17 23:57 4439.331
5: 8/28/17 00:00 4439.659
6: 8/29/17 00:01 4440.305
#Step 2 is to adjust your mdy_hms() command; you need to leave off the 's':
#Ex. before:
mdy_hms(as.character("8/28/17 16:19"))
[1] NA
Warning message:
All formats failed to parse. No formats found.
#After
test <- c("8/28/17 16:19","8/28/17 00:00")
mdy_hm(as.character(test))
[1] "2017-08-28 16:19:00 UTC" "2017-08-28 00:00:00 UTC"
In general, it's also good practice to have numbers be formatted without commas in R; so 4,439.3313 should be 4439.3313. Otherwise, R might interpret that as a comma separation between columns.
Upvotes: 4