Reputation: 5958
I'd like to calculate the mean difference of two columns of my data.frame
, grouping by a third.
apply
doesn't even let me compute any arithmetic operation without explicit conversion of already-numeric columns.data.table
makes the operation and grouping but returns a character vector.dplyr
syntax returns numeric values correctly.Why does apply() convert numeric vectors to character? Why does data.table
convert the results to char?
library(dplyr); library(data.table)
a <- letters[c(1,1:9)]
b <- (1:10)/10
c <- sin(1:10)
dat <- data.frame(a,b,c)
table(dat$a)
typeof(dat$b) #double
dat$bb <- apply(dat, 1,function(x) x["b"])
typeof(dat$bb) #character
dat$bb <- apply(dat, 1,function(x) x["b"]-x["c"])
# Error in x["b"] - x["c"] : non-numeric argument to binary operator
tidydat <- dat %>% group_by(a) %>% summarise(diffr = mean(b-c))
typeof(tidydat$diffr) #double
dt <- data.table(dat)
dt[,bb:=mean(b-c), by=a]
typeof(dt$bb) #character
> dt$bb
[1] "-0.725384205816789" "-0.725384205816789" "0.158879991940133" "1.15680249530793" "1.45892427466314"
[6] "0.879415498198926" "0.0430134012812109" "-0.189358246623382" "0.487881514758243" "1.54402111088937"
> tidydat$diffr
[1] -0.7253842 0.1588800 1.1568025 1.4589243 0.8794155 0.0430134 -0.1893582 0.4878815 1.5440211
EDIT this data.table
part is untrue, I was just modifying by reference an already existing char
column, from @Akrun
Upvotes: 3
Views: 357
Reputation: 102251
I think @akrun has provided sufficient information for understanding the reason behind. Actually you can try the code below to see what's going on when you use apply
by rows
> apply(dat, 1, str)
Named chr [1:3] "a" "0.1" " 0.8414710"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "a" "0.2" " 0.9092974"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "b" "0.3" " 0.1411200"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "c" "0.4" "-0.7568025"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "d" "0.5" "-0.9589243"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "e" "0.6" "-0.2794155"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "f" "0.7" " 0.6569866"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "g" "0.8" " 0.9893582"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "h" "0.9" " 0.4121185"
- attr(*, "names")= chr [1:3] "a" "b" "c"
Named chr [1:3] "i" "1.0" "-0.5440211"
- attr(*, "names")= chr [1:3] "a" "b" "c"
NULL
As you can see, when you run apply(dat,1,FUN = ...)
,the data passed to FUN
is coalesced to a vector of characters, instead of data.frame any more.
Upvotes: 0
Reputation: 887501
Using apply
, convert the dataset from data.frame
to matrix
> is.matrix(apply(dat, 1, I))
[1] TRUE
and matrix can have only a single class
i.e. if there is a character element, it converts the whole data into character. Instead use lapply
(if it is columnwise) or may also subset the numeric
columns before doing the apply
out <- apply(dat[-1], 1,function(x) x["b"]-x["c"])
-output
> out
[1] -0.7414710 -0.7092974 0.1588800 1.1568025 1.4589243 0.8794155 0.0430134 -0.1893582 0.4878815 1.5440211
> str(out)
num [1:10] -0.741 -0.709 0.159 1.157 1.459 ...
The reason for change in behavior is that vector
element have only a single class and in data.frame/data.table/tibble etc, the columns are the list
elements and not rows i.e. class is specific to a column and not a row
Regarding the data.table
case
> library(data.table)
> dt <- as.data.table(dat)
> dt$bb <- NULL # in case if the character column was already created
> dt[,bb:=mean(b-c), by=a]
> str(dt)
Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
$ a : chr "A" "A" "B" "C" ...
$ b : num 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
$ c : num 0.841 0.909 0.141 -0.757 -0.959 ...
$ bb: num -0.725 -0.725 0.159 1.157 0.704 ...
Upvotes: 3