Reputation: 53
I have a dataset that abbreviates numerical values in a column. For example, 12M mean 12 million, 1.2k means 1,200. M and k are the only abbreviations. How can I write code that allows R to sort these values from lowest to highest?
I've though about using gsub to convert M to 000,000 etc but that does not take into account the decimals (1.5M would then be 1.5000000).
Upvotes: 5
Views: 1508
Reputation: 593
All other answers didn't work nicely with NA for me (or produce warnings, which is also not nice).
Here is my solution, which reuses some bits from other solutions. (also posted at R data.table speed up SI / Metric Conversion)
library(stringr)
si2num <- function(x)
{
conv <- paste0("e", c(seq(-24 ,-3, by=3), -2, -1, seq(3, 24, by=3),3))
names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","K","M","G","T","P","E","Z","Y","k")
xout <- str_replace_all(x, conv)
xout <- as.numeric(xout)
return(xout)
}
x <- c(NA,"10", "10.01K",NA,"10.1M", "20K", "21k",NA)
si2num(x)
[1] NA 10 10010 NA 10100000 20000 21000 NA
Upvotes: 0
Reputation: 776
I am glad to meet you.
I wrote another answer
res = function (x) {
result = as.numeric(x)
if(is.na(result)){
text = gsub("k", "*1e3", x, ignore.case = T)
text = gsub("m", "*1e6", text, ignore.case = T)
result = eval(parse(text = text))
}
return(result)
}
> res("5M")
[1] 5e+06
> res("4K")
[1] 4000
> res("100")
[1] 100
> res("4k")
[1] 4000
> res("1e3")
[1] 1000
Upvotes: 0
Reputation: 33938
> 10 ** (3*as.integer(regexpr('T', 'KMGTPEY')))
[1] 1e+12
Then just multiply that power-of-ten by the decimal value you have.
-1*3
> unit_to_power <- function(u) {
exp_ <- 10**(as.integer(regexpr(u, 'KMGTPEY')) *3)
return (if(exp_>=0) exp_ else 1)
}
Now if you want to case-insensitive-match both 'k' and 'K' to Kilo (as computer people often write, even though it's technically an abuse of SI), then you'll need to special-case e.g with if-else ladder/expression (SI units are case-sensitive in general, 'M' means 'Mega' but 'm' strictly means 'milli' even if disk-drive users say otherwise; upper-case is conventionally for positive exponents). So for a few prefixes, @DanielV's case-specific code is better.
If you want negative SI prefixes too, use as.integer(regexpr(u, 'zafpnum@KMGTPEY')-8)
where @
is just some throwaway character to keep uniform spacing, it shouldn't actually get matched. Again if you need to handle non-power-of-10**3 units like 'deci', 'centi', will require special-casing, or the general dict-based approach WeNYoBen uses.
base::regexpr
is not vectorized also its performance is bad on big inputs, so if you want to vectorize and get higher-performance use stringr::str_locate
.
Upvotes: 6
Reputation: 323266
In your case you can using gsubfn
a=c('12M','1.2k')
dict<-list("k" = "e3", "M" = "e6")
as.numeric(gsubfn::gsubfn(paste(names(dict),collapse="|"),dict,a))
[1] 1.2e+07 1.2e+03
Upvotes: 1
Reputation: 1386
Give this a shot:
Text_Num <- function(x){
if (grepl("M", x, ignore.case = TRUE)) {
as.numeric(gsub("M", "", x, ignore.case = TRUE)) * 1e6
} else if (grepl("k", x, ignore.case = TRUE)) {
as.numeric(gsub("k", "", x, ignore.case = TRUE)) * 1e3
} else {
as.numeric(x)
}
}
Upvotes: 3