RabidMutant
RabidMutant

Reputation: 611

In R (or RStudio) I want to convert a field with units to a number, but I get erroneous warnings

Using R, I have data that looks like:

A,   B
=  =====
x,"  120"
y," 2300"
z,"1.2 M"
x," 4500"
x," 42 M"

This was constructed using:

A <- c("x","y","z","x","x")
B <- c("  120", " 2300", "1.2 M", " 4500", " 42 M")
data <- data.frame(A, B)

I want to convert the second column to a valid number (the 'M' signifies millions). What I have is:

df <- data %>%
    mutate( B = ifelse( grepl("^ *[0-9]+$", .$B), 
                        as.numeric(.$B),
                        1000000 * as.numeric(sub(" *([0-9]+) M", "\\1", .$B))))

ie. it looks for a field with an 'M', and if found does as.numeric on the sustricg and multiplies by 1000000.

This works fine, EXCEPT I get a warning:

Warning messages:
1: Problem with `mutate()` input `B`.
i NAs introduced by coercion
i Input `B` is `ifelse(...)`. 
2: In ifelse(grepl("^ *[0-9]+$", .$B), as.numeric(.$B), 1e+06 *  :
  NAs introduced by coercion

But when I inspect the data using df %>% filter(is.na(B)), there are no NAs

I don't want to disable the warning entirely, but I do want to avoid it when it's "invalid".

Any suggestions? Ami I missing something?

Upvotes: 3

Views: 1298

Answers (2)

r2evans
r2evans

Reputation: 160952

I have a function, not completely tested, that's a companion to another I wrote (that is based loosely off of utils:::format.object_size).

unKMG <- function(s, standard = "SI") {
  known_bases <- c(legacy = 1024, IEC = 1024, SI = 1000)
  known_units <- list(SI = c("", "k", "M", "G", "T", "P", 
      "E", "Z", "Y"), IEC = c("", "Ki", "Mi", "Gi", 
      "Ti", "Pi", "Ei", "Zi", "Yi"), legacy = c("", "K", 
      "M", "G", "T", "P"))
  standard <- match.arg(standard, c("auto", names(known_bases)))
  powers <- known_bases[[standard]] ^
    (setNames(seq_along(known_units[[standard]]), known_units[[standard]])-1)
  sapply(strsplit(trimws(s), "[[:space:]]+"),
         function(z) {
           nums <- suppressWarnings(as.numeric(z))
           prod(nums[!is.na(nums)], powers[names(powers) %in% z[is.na(nums)]])
         })
}

unKMG(dat$B)
# [1]      120     2300  1200000     4500 42000000

But if you want a one-shot approach, try

as.numeric(gsub(".*\\b(-?[0-9]+\\.?[0-9]*)\\b.*", "\\1", dat$B)) * 
  ifelse(grepl("M", dat$B), 1000000, 1)
# [1]      120     2300 12000000     4500 42000000

dat %>%
  mutate(
    B = as.numeric(gsub(".*\\b(-?[0-9]+\\.?[0-9]*)\\b.*", "\\1", B)) * 
          ifelse(grepl("M", B), 1000000, 1)
  )

Explanation of the regex:

  • .* anything (including nothing)
  • \\b a word-boundary, not consumed (included); this ensures we don't start or end a number in the middle of a larger number, for instance
  • (...) group, is referenced in the second argument as \\1
  • -? 0 or 1 of the negative sign
  • [0-9]+ 1 or more digits
  • \\.? 0 or 1 of the decimal point (replace with , or [,.] for different locales)
  • [0-9]* 0 or more digits

(I named my data dat, if you hadn't noticed ... I've been bitten too many times by errors about invalid 'type' (closure) of argument when I misplaced my data that now I tend to use other names that do not collide with base R functions.)

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389275

The reason why you get the warning is because ifelse/if_else/case_when pass the entire vector in both yes and no part. The output to return is later decided based on the condition.

So we think that we are applying as.numeric(B) to only those numbers that satisfy the condition but in reality it is same as applying as.numeric(data$B) which gives the same warning that few values in B cannot be converted to numeric. So you need to use ifelse in a way that doesn't generate a warning at all for any value (which is shown by @r2evans already) or apply the manipulation separately without ifelse.

data$C <- NA_real_
inds <- grepl("^ *[0-9]+$", data$B)
data$C[inds] <- as.numeric(B[inds])
data$C[!inds] <- 1000000 * as.numeric(sub(" *([0-9]+) M", "\\1", data$B[!inds]))

Upvotes: 3

Related Questions