Reputation: 481
I have a data.table that contains name
and class
. Each name
belongs to one class
. Here is a sample dataset.
library(data.table)
DT <- data.table(name = c("John","Smith","Jane","Ruby","Emerald","Jasmine","Tulip"),
class = c(1,2,3))
I want to create a column which contains all the students in the class that a person belongs to. This is how I'm doing it and it works.
DT[, class.students := paste(.SD), .SDcols = "name", by = "class"]
I'm trying to understand why the following doesn't work, i.e., it does not evaluate the function over the list of all name
's in the group (it returns just the name
value of the row in the created column)
DT[, class.students := paste(name), by = "class"]
Especially when the code below with max
works as expected, i.e., it evaluates over all elements in the group and returns the same value for each group.
DT[, class.students := max(name), by = "class"]
What am I missing here?
EDIT: max
is a bad example as it doesn't work in the first way, using .SDcols
, but I hope what I am trying to convey is clear.
Upvotes: 4
Views: 71
Reputation: 887118
.SD
is a list
, so it returns an output which may not be desired one (if we check the str
). As a small example
paste(list(letters[1:3])) #not the desirable output
#[1] "c(\"a\", \"b\", \"c\")"
paste(letters[1:3]) #did not change anything
#[1] "a" "b" "c"
However, paste
also have sep
and collapse
as arguments
paste(letters[1:3], collapse=", ")
#[1] "a, b, c"
Using the OP's example,
DT[, class.students := paste(name, collapse=", "), by = class]
We would recommend to apply a function directly to .SD
, but if there is a single column, we can convert the list
to vector
either by [[
or converting with unlist
etc.
DT[, class.students := paste(unlist(.SD), collapse=", "), by = class]
Or
DT[, class.students := paste(.SD[[1]], collapse=", "), by = class]
If we check the str(DT)
from all the above, it would be the same
Regarding the optimal way to apply functions on .SD
- as we already mentioned that it is a list
. and .SD
is useful when there are more number of columns. As in data.frame
, when there are multiple column, we loop through the columns with lapply
and proceed
DT[, class.students := lapply(.SD, paste, collapse=", "), by = class]
We can also specify the .SDcols
if there are only a subset of columns are used. Here, in the example, there are only two columns, so .SDcols
is not needed.
str(DT)
#Classes ‘data.table’ and 'data.frame': 7 obs. of 3 variables:
# $ name : chr "John" "Smith" "Jane" "Ruby" ...
# $ class : num 1 2 3 1 2 3 1
# $ class.students: chr "John, Ruby, Tulip" "Smith, Emerald" "Jane, Jasmine" "John, Ruby, Tulip" ...
# - attr(*, ".internal.selfref")=<externalptr>
Upvotes: 3