gaut
gaut

Reputation: 5958

Automatic num-to-char conversion in R using apply and data.table

I'd like to calculate the mean difference of two columns of my data.frame, grouping by a third.

Why does apply() convert numeric vectors to character? Why does data.table convert the results to char?

library(dplyr); library(data.table)
a <- letters[c(1,1:9)]
b <- (1:10)/10
c <- sin(1:10)
dat <- data.frame(a,b,c)
table(dat$a)
typeof(dat$b) #double
dat$bb <- apply(dat, 1,function(x) x["b"])
typeof(dat$bb) #character
dat$bb <- apply(dat, 1,function(x) x["b"]-x["c"])
# Error in x["b"] - x["c"] : non-numeric argument to binary operator
tidydat <- dat %>% group_by(a) %>% summarise(diffr = mean(b-c))
typeof(tidydat$diffr) #double
dt <- data.table(dat)
dt[,bb:=mean(b-c), by=a]
typeof(dt$bb) #character

> dt$bb
 [1] "-0.725384205816789" "-0.725384205816789" "0.158879991940133"  "1.15680249530793"   "1.45892427466314"  
 [6] "0.879415498198926"  "0.0430134012812109" "-0.189358246623382" "0.487881514758243"  "1.54402111088937"  
> tidydat$diffr
[1] -0.7253842  0.1588800  1.1568025  1.4589243  0.8794155  0.0430134 -0.1893582  0.4878815  1.5440211

EDIT this data.table part is untrue, I was just modifying by reference an already existing char column, from @Akrun

Upvotes: 3

Views: 357

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 102251

I think @akrun has provided sufficient information for understanding the reason behind. Actually you can try the code below to see what's going on when you use apply by rows

> apply(dat, 1, str)
 Named chr [1:3] "a" "0.1" " 0.8414710"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
 Named chr [1:3] "a" "0.2" " 0.9092974"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
 Named chr [1:3] "b" "0.3" " 0.1411200"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
 Named chr [1:3] "c" "0.4" "-0.7568025"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
 Named chr [1:3] "d" "0.5" "-0.9589243"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
 Named chr [1:3] "e" "0.6" "-0.2794155"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
 Named chr [1:3] "f" "0.7" " 0.6569866"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
 Named chr [1:3] "g" "0.8" " 0.9893582"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
 Named chr [1:3] "h" "0.9" " 0.4121185"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
 Named chr [1:3] "i" "1.0" "-0.5440211"
 - attr(*, "names")= chr [1:3] "a" "b" "c"
NULL

As you can see, when you run apply(dat,1,FUN = ...) ,the data passed to FUN is coalesced to a vector of characters, instead of data.frame any more.

Upvotes: 0

akrun
akrun

Reputation: 887501

Using apply, convert the dataset from data.frame to matrix

> is.matrix(apply(dat, 1, I))
[1] TRUE

and matrix can have only a single class i.e. if there is a character element, it converts the whole data into character. Instead use lapply (if it is columnwise) or may also subset the numeric columns before doing the apply

out <- apply(dat[-1], 1,function(x) x["b"]-x["c"]) 

-output

> out
 [1] -0.7414710 -0.7092974  0.1588800  1.1568025  1.4589243  0.8794155  0.0430134 -0.1893582  0.4878815  1.5440211
> str(out)
 num [1:10] -0.741 -0.709 0.159 1.157 1.459 ...

The reason for change in behavior is that vector element have only a single class and in data.frame/data.table/tibble etc, the columns are the list elements and not rows i.e. class is specific to a column and not a row


Regarding the data.table case

> library(data.table)
> dt <- as.data.table(dat)
> dt$bb <- NULL # in case if the character column was already created
> dt[,bb:=mean(b-c), by=a]
> str(dt)
Classes ‘data.table’ and 'data.frame':  10 obs. of  4 variables:
 $ a : chr  "A" "A" "B" "C" ...
 $ b : num  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
 $ c : num  0.841 0.909 0.141 -0.757 -0.959 ...
 $ bb: num  -0.725 -0.725 0.159 1.157 0.704 ...

Upvotes: 3

Related Questions