Reputation: 49

"Handling Format Inconsistencies in Time Variable When Importing Data in R"

I have a dataset in R where a time variable has been imported as text. This is because, not specifying it to be imported as text results in many observations being converted to NAs. However, I've discovered that the time variable has inconsistent formatting. Some rows have numeric values (for example, 0.24962962962962965), while others have the HH:MM:SS format (for example, 07:19:52). My goal is to convert this variable into a consistent HH:MM:SS time format in R.

How can I address this situation and convert the time variable to a consistent HH:MM:SS format for the entire dataset?

I've tried some approaches using mathematical operations and conversion functions, but I'm unsure how to handle both numeric values and time formats in a single column.

I've attached simplified data for this case:

datos_texto <- c("0.24962962962962965", "07:19:52", "0.123456789", "10:45:30", "0.567891234")

I would greatly appreciate any advice or code examples that could help me solve this issue and obtain a time variable in the desired format.

Thank you in advance for your assistance!

Upvotes: 0

Answers (2)

Darwin Galvis

Reputation: 49

I wanted to share the solution I finally found to address the issue I raised in this question. I've been working on it and I appreciate everyone who tried to help. Your contributions were invaluable and guided me in the right direction.

After researching and testing different approaches, I managed to solve the problem. The key was to handle the different formats in the datos_texto vector correctly. I used the case_when() function along with the chron package to tackle various numeric and time formats with help of regular expressions.

Here is the code that eventually worked for me:

# Load and install the required libraries
# install.packages("chron")

library(tidyverse)
library(chron)

df <- data.frame(datos_texto = c("0.24962962962962965", "07:19:52", "0.123456789", "10:45:30", "0.567891234"))

df <- df %>% 
  # Transform the 'datos_texto' column using case_when()
  mutate(
    datos_texto = case_when(
      # If the pattern matches a numeric value with a decimal point
      grepl('^\\d\\.\\d+$', datos_texto) ~ chron::times(as.double(df$datos_texto)),
      # If it doesn't match the numeric pattern
      TRUE ~ chron::times(datos_texto)
    )
  )

I want to express my gratitude once again to everyone who took the time to respond to my question and provide suggestions. Your support was invaluable in my troubleshooting process!

Upvotes: 0

r2evans

Reputation: 160417

Two suggestions here: convert all decimal-days to HH:MM:SS.SSS; or convert all timestamps to decimal days.

Convert all to HH:MM:SS

We can use this function num2time to convert decimal values to times, assuming decimal is "decimal days" (so 0.25 is a quarter way through the day, or 06:00:00).

num2time <- function(x, digits.secs = getOption("digits.secs", 3)) {
  hr <- as.integer(x %/% 3600)
  min <- as.integer((x - 3600*hr) %/% 60)
  sec <- (x - 3600*hr - 60*min)
  if (anyNA(digits.secs)) {
    # a mostly-arbitrary determination of significant digits,
    # motivated by @Roland https://stackoverflow.com/a/27767973
    for (digits.secs in 1:6) {
      if (any(abs(signif(sec, digits.secs) - sec) > (10^(-3 - digits.secs)))) next
      digits.secs <- digits.secs - 1L
      break
    }
  }
  sec <- sprintf(paste0("%02.", digits.secs[[1]], "f"), sec)
  sec <- paste0(ifelse(grepl("^[0-9]\\.", sec), "0", ""), sec)
  out <- sprintf("%02i:%02i:%s", hr, min, sec)
  out[is.na(x)] <- NA_character_
  out
}

With this,

nocolon <- !grepl(":", datos_texto)
datos_texto[nocolon] <- num2time(as.numeric(datos_texto[nocolon]) * 86400)
datos_texto
# [1] "05:59:28.000" "07:19:52"     "02:57:46.667" "10:45:30"     "13:37:45.803"

This can then be handled the same, whether retaining as a character string or converting into a "timestamp" (without date component) with something like

lubridate::hms(datos_texto)
# [1] "5H 59M 28S"      "7H 19M 52S"      "2H 57M 46.667S"  "10H 45M 30S"     "13H 37M 45.803S"

hms::parse_hms(datos_texto)
# 05:59:28.000
# 07:19:52.000
# 02:57:46.667
# 10:45:30.000
# 13:37:45.803
str(hms::parse_hms(datos_texto))
#  'hms' num [1:5] 05:59:28.000 07:19:52.000 02:57:46.667 10:45:30.000 ...
#  - attr(*, "units")= chr "secs"

since in that format, numerical operations (plus, minus, difference, etc) are clearly defined.

Convert all to decimal-days

Another option is to convert the time-like fields to numeric.

time2num <- function(x) {
  vapply(strsplit(x, ':'), function(y) sum(as.numeric(y) * 60^((length(y)-1):0)),
         numeric(1), USE.NAMES=FALSE)
}

With this,

out <- numeric(length(datos_texto))
nocolon <- !grepl(":", datos_texto)
out[nocolon] <- as.numeric(datos_texto[nocolon])
out[!nocolon] <- time2num(datos_texto[!nocolon]) / 86400
out
# [1] 0.2496296 0.3054630 0.1234568 0.4482639 0.5678912

and now out is numeric as decimal days for all of datos_texto.

Incidentally, one might be tempted to do datos_texto[nocolon] <- as.numeric(datos_texto[nocolon]). Realize that datos_texto, unless all of it is replaced all at once, will remain character, so the results of as.numeric are lost. It is definitely possible to convert the :-containing strings with time2num in-place, but they will be converted to strings, so you'll end up with:

datos_texto[!nocolon] <- time2num(datos_texto[!nocolon]) / 86400
datos_texto
# [1] "0.24962962962963"  "0.305462962962963" "0.123456793981481" "0.448263888888889" "0.567891238425926"

This generally comes up with the same result, but time2num converts to a floating-point numeric, and then replacing it into subsets of datos_texto results in it being converted to string representations of the floating-point numbers. This is easily converted again as

as.numeric(datos_texto)
# [1] 0.2496296 0.3054630 0.1234568 0.4482639 0.5678912

but converting to number then string then number is inefficient (and R is relatively inefficient with large amounts of strings, google R global string pool, visit Object size for characters in R - How does R global string pool work? and https://adv-r.hadley.nz/names-values.html, and put your learning-cap on). This also works, but I recommend and prefer the use of a numeric-vector for this.

Upvotes: 0

&quot;Handling Format Inconsistencies in Time Variable When Importing Data in R&quot;

Answers (2)

Convert all to HH:MM:SS

Convert all to decimal-days

Related Questions

"Handling Format Inconsistencies in Time Variable When Importing Data in R"