Reputation: 611
Using R, I have data that looks like:
A, B
= =====
x," 120"
y," 2300"
z,"1.2 M"
x," 4500"
x," 42 M"
This was constructed using:
A <- c("x","y","z","x","x")
B <- c(" 120", " 2300", "1.2 M", " 4500", " 42 M")
data <- data.frame(A, B)
I want to convert the second column to a valid number (the 'M' signifies millions). What I have is:
df <- data %>%
mutate( B = ifelse( grepl("^ *[0-9]+$", .$B),
as.numeric(.$B),
1000000 * as.numeric(sub(" *([0-9]+) M", "\\1", .$B))))
ie. it looks for a field with an 'M', and if found does as.numeric on the sustricg and multiplies by 1000000.
This works fine, EXCEPT I get a warning:
Warning messages:
1: Problem with `mutate()` input `B`.
i NAs introduced by coercion
i Input `B` is `ifelse(...)`.
2: In ifelse(grepl("^ *[0-9]+$", .$B), as.numeric(.$B), 1e+06 * :
NAs introduced by coercion
But when I inspect the data using df %>% filter(is.na(B))
, there are no NAs
I don't want to disable the warning entirely, but I do want to avoid it when it's "invalid".
Any suggestions? Ami I missing something?
Upvotes: 3
Views: 1298
Reputation: 160952
I have a function, not completely tested, that's a companion to another I wrote (that is based loosely off of utils:::format.object_size
).
unKMG <- function(s, standard = "SI") {
known_bases <- c(legacy = 1024, IEC = 1024, SI = 1000)
known_units <- list(SI = c("", "k", "M", "G", "T", "P",
"E", "Z", "Y"), IEC = c("", "Ki", "Mi", "Gi",
"Ti", "Pi", "Ei", "Zi", "Yi"), legacy = c("", "K",
"M", "G", "T", "P"))
standard <- match.arg(standard, c("auto", names(known_bases)))
powers <- known_bases[[standard]] ^
(setNames(seq_along(known_units[[standard]]), known_units[[standard]])-1)
sapply(strsplit(trimws(s), "[[:space:]]+"),
function(z) {
nums <- suppressWarnings(as.numeric(z))
prod(nums[!is.na(nums)], powers[names(powers) %in% z[is.na(nums)]])
})
}
unKMG(dat$B)
# [1] 120 2300 1200000 4500 42000000
But if you want a one-shot approach, try
as.numeric(gsub(".*\\b(-?[0-9]+\\.?[0-9]*)\\b.*", "\\1", dat$B)) *
ifelse(grepl("M", dat$B), 1000000, 1)
# [1] 120 2300 12000000 4500 42000000
dat %>%
mutate(
B = as.numeric(gsub(".*\\b(-?[0-9]+\\.?[0-9]*)\\b.*", "\\1", B)) *
ifelse(grepl("M", B), 1000000, 1)
)
Explanation of the regex:
.*
anything (including nothing)\\b
a word-boundary, not consumed (included); this ensures we don't start or end a number in the middle of a larger number, for instance(...)
group, is referenced in the second argument as \\1
-?
0 or 1 of the negative sign[0-9]+
1 or more digits\\.?
0 or 1 of the decimal point (replace with ,
or [,.]
for different locales)[0-9]*
0 or more digits(I named my data dat
, if you hadn't noticed ... I've been bitten too many times by errors about invalid 'type' (closure) of argument
when I misplaced my data
that now I tend to use other names that do not collide with base R functions.)
Upvotes: 2
Reputation: 389275
The reason why you get the warning is because ifelse
/if_else
/case_when
pass the entire vector in both yes
and no
part. The output to return is later decided based on the condition.
So we think that we are applying as.numeric(B)
to only those numbers that satisfy the condition but in reality it is same as applying as.numeric(data$B)
which gives the same warning that few values in B
cannot be converted to numeric. So you need to use ifelse
in a way that doesn't generate a warning at all for any value (which is shown by @r2evans already) or apply the manipulation separately without ifelse
.
data$C <- NA_real_
inds <- grepl("^ *[0-9]+$", data$B)
data$C[inds] <- as.numeric(B[inds])
data$C[!inds] <- 1000000 * as.numeric(sub(" *([0-9]+) M", "\\1", data$B[!inds]))
Upvotes: 3