Lafayette
Lafayette

Reputation: 367

How to convert character to numeric within data.table for specific columns?

Dataset below has the characteristics of my large dataset. I am managing it in data.table, some columns are loaded as chr despite they are numbers and I want to convert them into numerics and these column names are known

dt = data.table(A=LETTERS[1:10],B=letters[1:10],C=as.character(runif(10)),D = as.character(runif(10))) # simplified version
strTmp = c('C','D') # Name of columns to be converted to numeric

# columns converted to numeric and returned a  10 x 2 data.table
dt.out1 <- dt[,lapply(.SD, as.numeric, na.rm = T), .SDcols = strTmp]

I am able to convert those 2 columns to numeric with the code above however I want to update dt instead. I tried using := however it didn't work. I need help here!

dt.out2 <- dt[, strTmp:=lapply(.SD, as.numeric, na.rm = T), .SDcols = strTmp] # returned a 10 x 6 data.table (2 columns extra)

I even tried the code below (coded like a data.frame - not my ideal solution even if it works as I am worried in some cases the order might change) but it still doesn't work. Can someone let me know why it doesn't work please?

dt[,strTmp,with=F] <- dt[,lapply(.SD, as.numeric, na.rm = T), .SDcols = strTmp]

Thanks in advance!

Upvotes: 22

Views: 22151

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

While Roland's answer is more idiomatic, you can also consider set within a loop for something as direct as this. An approach might be something like:

strTmp = c('C','D')
ind <- match(strTmp, names(dt))

for (i in seq_along(ind)) {
  set(dt, NULL, ind[i], as.numeric(dt[[ind[i]]]))
}

str(dt)
# Classes ‘data.table’ and 'data.frame':  10 obs. of  4 variables:
#  $ A: chr  "A" "B" "C" "D" ...
#  $ B: chr  "a" "b" "c" "d" ...
#  $ C: num  0.308 0.564 0.255 0.828 0.128 ...
#  $ D: num  0.635 0.0485 0.6281 0.4793 0.7 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

From the help page at ?set, this would avoid some of the [.data.table overhead if that ever becomes a problem for you.

Upvotes: 10

Roland
Roland

Reputation: 132989

  1. You don't need to assign the whole data.table if you assign by reference with := (i.e., you don't need dt.out2 <-).

  2. You need to wrap the LHS of := in parentheses to make sure it is evaluated (and not used as the name).

Like this:

dt[, (strTmp) := lapply(.SD, as.numeric), .SDcols = strTmp]
str(dt)
#Classes ‘data.table’ and 'data.frame': 10 obs. of  4 variables:
# $ A: chr  "A" "B" "C" "D" ...
# $ B: chr  "a" "b" "c" "d" ...
# $ C: num  0.30204 0.00269 0.46774 0.08641 0.02011 ...
# $ D: num  0.151 0.0216 0.5689 0.3536 0.26 ...
# - attr(*, ".internal.selfref")=<externalptr> 

Upvotes: 43

Related Questions