Reputation: 49
I have a dataset in R where a time variable has been imported as text. This is because, not specifying it to be imported as text results in many observations being converted to NAs. However, I've discovered that the time variable has inconsistent formatting. Some rows have numeric values (for example, 0.24962962962962965), while others have the HH:MM:SS format (for example, 07:19:52). My goal is to convert this variable into a consistent HH:MM:SS time format in R.
How can I address this situation and convert the time variable to a consistent HH:MM:SS format for the entire dataset?
I've tried some approaches using mathematical operations and conversion functions, but I'm unsure how to handle both numeric values and time formats in a single column.
I've attached simplified data for this case:
datos_texto <- c("0.24962962962962965", "07:19:52", "0.123456789", "10:45:30", "0.567891234")
I would greatly appreciate any advice or code examples that could help me solve this issue and obtain a time variable in the desired format.
Thank you in advance for your assistance!
Upvotes: 0
Views: 30
Reputation: 49
I wanted to share the solution I finally found to address the issue I raised in this question. I've been working on it and I appreciate everyone who tried to help. Your contributions were invaluable and guided me in the right direction.
After researching and testing different approaches, I managed to solve the problem. The key was to handle the different formats in the datos_texto vector correctly. I used the case_when() function along with the chron package to tackle various numeric and time formats with help of regular expressions.
Here is the code that eventually worked for me:
# Load and install the required libraries
# install.packages("chron")
library(tidyverse)
library(chron)
df <- data.frame(datos_texto = c("0.24962962962962965", "07:19:52", "0.123456789", "10:45:30", "0.567891234"))
df <- df %>%
# Transform the 'datos_texto' column using case_when()
mutate(
datos_texto = case_when(
# If the pattern matches a numeric value with a decimal point
grepl('^\\d\\.\\d+$', datos_texto) ~ chron::times(as.double(df$datos_texto)),
# If it doesn't match the numeric pattern
TRUE ~ chron::times(datos_texto)
)
)
I want to express my gratitude once again to everyone who took the time to respond to my question and provide suggestions. Your support was invaluable in my troubleshooting process!
Upvotes: 0
Reputation: 160417
Two suggestions here: convert all decimal-days to HH:MM:SS.SSS
; or convert all timestamps to decimal days.
We can use this function num2time
to convert decimal values to times, assuming decimal is "decimal days" (so 0.25
is a quarter way through the day, or 06:00:00
).
num2time <- function(x, digits.secs = getOption("digits.secs", 3)) {
hr <- as.integer(x %/% 3600)
min <- as.integer((x - 3600*hr) %/% 60)
sec <- (x - 3600*hr - 60*min)
if (anyNA(digits.secs)) {
# a mostly-arbitrary determination of significant digits,
# motivated by @Roland https://stackoverflow.com/a/27767973
for (digits.secs in 1:6) {
if (any(abs(signif(sec, digits.secs) - sec) > (10^(-3 - digits.secs)))) next
digits.secs <- digits.secs - 1L
break
}
}
sec <- sprintf(paste0("%02.", digits.secs[[1]], "f"), sec)
sec <- paste0(ifelse(grepl("^[0-9]\\.", sec), "0", ""), sec)
out <- sprintf("%02i:%02i:%s", hr, min, sec)
out[is.na(x)] <- NA_character_
out
}
With this,
nocolon <- !grepl(":", datos_texto)
datos_texto[nocolon] <- num2time(as.numeric(datos_texto[nocolon]) * 86400)
datos_texto
# [1] "05:59:28.000" "07:19:52" "02:57:46.667" "10:45:30" "13:37:45.803"
This can then be handled the same, whether retaining as a character string or converting into a "timestamp" (without date component) with something like
lubridate::hms(datos_texto)
# [1] "5H 59M 28S" "7H 19M 52S" "2H 57M 46.667S" "10H 45M 30S" "13H 37M 45.803S"
hms::parse_hms(datos_texto)
# 05:59:28.000
# 07:19:52.000
# 02:57:46.667
# 10:45:30.000
# 13:37:45.803
str(hms::parse_hms(datos_texto))
# 'hms' num [1:5] 05:59:28.000 07:19:52.000 02:57:46.667 10:45:30.000 ...
# - attr(*, "units")= chr "secs"
since in that format, numerical operations (plus, minus, difference, etc) are clearly defined.
Another option is to convert the time-like fields to numeric.
time2num <- function(x) {
vapply(strsplit(x, ':'), function(y) sum(as.numeric(y) * 60^((length(y)-1):0)),
numeric(1), USE.NAMES=FALSE)
}
With this,
out <- numeric(length(datos_texto))
nocolon <- !grepl(":", datos_texto)
out[nocolon] <- as.numeric(datos_texto[nocolon])
out[!nocolon] <- time2num(datos_texto[!nocolon]) / 86400
out
# [1] 0.2496296 0.3054630 0.1234568 0.4482639 0.5678912
and now out
is numeric
as decimal days for all of datos_texto
.
Incidentally, one might be tempted to do datos_texto[nocolon] <- as.numeric(datos_texto[nocolon])
. Realize that datos_texto
, unless all of it is replaced all at once, will remain character
, so the results of as.numeric
are lost. It is definitely possible to convert the :
-containing strings with time2num
in-place, but they will be converted to strings, so you'll end up with:
datos_texto[!nocolon] <- time2num(datos_texto[!nocolon]) / 86400
datos_texto
# [1] "0.24962962962963" "0.305462962962963" "0.123456793981481" "0.448263888888889" "0.567891238425926"
This generally comes up with the same result, but time2num
converts to a floating-point numeric
, and then replacing it into subsets of datos_texto
results in it being converted to string representations of the floating-point numbers. This is easily converted again as
as.numeric(datos_texto)
# [1] 0.2496296 0.3054630 0.1234568 0.4482639 0.5678912
but converting to number then string then number is inefficient (and R is relatively inefficient with large amounts of strings, google R global string pool
, visit Object size for characters in R - How does R global string pool work? and https://adv-r.hadley.nz/names-values.html, and put your learning-cap on). This also works, but I recommend and prefer the use of a numeric
-vector for this.
Upvotes: 0