Michael O'Keefe
Michael O'Keefe

Reputation: 53

Converting unit abbreviations to numbers

I have a dataset that abbreviates numerical values in a column. For example, 12M mean 12 million, 1.2k means 1,200. M and k are the only abbreviations. How can I write code that allows R to sort these values from lowest to highest?

I've though about using gsub to convert M to 000,000 etc but that does not take into account the decimals (1.5M would then be 1.5000000).

Upvotes: 5

Views: 1508

Answers (5)

Nikolay
Nikolay

Reputation: 593

All other answers didn't work nicely with NA for me (or produce warnings, which is also not nice).

Here is my solution, which reuses some bits from other solutions. (also posted at R data.table speed up SI / Metric Conversion)

library(stringr)

si2num <- function(x)
{
  conv <- paste0("e", c(seq(-24 ,-3, by=3), -2, -1, seq(3, 24, by=3),3))
  names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","K","M","G","T","P","E","Z","Y","k")
  
  xout <- str_replace_all(x, conv)
  xout <- as.numeric(xout)
  
  return(xout)
}

x <- c(NA,"10", "10.01K",NA,"10.1M", "20K", "21k",NA)

si2num(x)
[1]       NA       10    10010       NA 10100000    20000    21000       NA

Upvotes: 0

Steve Lee
Steve Lee

Reputation: 776

I am glad to meet you.

I wrote another answer

Define function

res = function (x) {
  result = as.numeric(x)
  if(is.na(result)){
  text = gsub("k", "*1e3", x, ignore.case = T)
  text = gsub("m", "*1e6", text, ignore.case = T)
  result = eval(parse(text = text))
  } 
  return(result)
}

Result

> res("5M")
[1] 5e+06
> res("4K")
[1] 4000
> res("100")
[1] 100
> res("4k")
[1] 4000
> res("1e3")
[1] 1000

Upvotes: 0

smci
smci

Reputation: 33938

  • So you want to translate SI unit abbreviations ('K','M',...) into exponents, and thus numerical powers-of-ten. Given that all units are single-letter, and the exponents are uniformly-spaced powers of 10**3, here's working code that handles 'Kilo'...'Yotta', and any future exponents:
    > 10 ** (3*as.integer(regexpr('T', 'KMGTPEY')))
    [1] 1e+12

Then just multiply that power-of-ten by the decimal value you have.

  • Also, you probably want to detect and handle the 'no-match' case for unknown letter prefixes, otherwise you'd get a nonsensical -1*3
    > unit_to_power <- function(u) {
        exp_ <- 10**(as.integer(regexpr(u, 'KMGTPEY')) *3)
        return (if(exp_>=0) exp_ else 1)
    }
  • Now if you want to case-insensitive-match both 'k' and 'K' to Kilo (as computer people often write, even though it's technically an abuse of SI), then you'll need to special-case e.g with if-else ladder/expression (SI units are case-sensitive in general, 'M' means 'Mega' but 'm' strictly means 'milli' even if disk-drive users say otherwise; upper-case is conventionally for positive exponents). So for a few prefixes, @DanielV's case-specific code is better.

  • If you want negative SI prefixes too, use as.integer(regexpr(u, 'zafpnum@KMGTPEY')-8) where @ is just some throwaway character to keep uniform spacing, it shouldn't actually get matched. Again if you need to handle non-power-of-10**3 units like 'deci', 'centi', will require special-casing, or the general dict-based approach WeNYoBen uses.

  • base::regexpr is not vectorized also its performance is bad on big inputs, so if you want to vectorize and get higher-performance use stringr::str_locate.

Upvotes: 6

BENY
BENY

Reputation: 323266

In your case you can using gsubfn

a=c('12M','1.2k')
dict<-list("k" = "e3", "M" = "e6")
as.numeric(gsubfn::gsubfn(paste(names(dict),collapse="|"),dict,a))
[1] 1.2e+07 1.2e+03

Upvotes: 1

Daniel V
Daniel V

Reputation: 1386

Give this a shot:

Text_Num <- function(x){
    if (grepl("M", x, ignore.case = TRUE)) {
        as.numeric(gsub("M", "", x, ignore.case = TRUE)) * 1e6
    } else if (grepl("k", x, ignore.case = TRUE)) {
        as.numeric(gsub("k", "", x, ignore.case = TRUE)) * 1e3
    } else {
        as.numeric(x)
    }
}

Upvotes: 3

Related Questions