Naumz
Naumz

Reputation: 481

What is the difference between passing a single column to j and selecting the same column from a data.table?

I have a data.table that contains name and class. Each name belongs to one class. Here is a sample dataset.

library(data.table)
DT <- data.table(name = c("John","Smith","Jane","Ruby","Emerald","Jasmine","Tulip"),
             class = c(1,2,3))

I want to create a column which contains all the students in the class that a person belongs to. This is how I'm doing it and it works.

DT[, class.students := paste(.SD), .SDcols = "name", by = "class"]

I'm trying to understand why the following doesn't work, i.e., it does not evaluate the function over the list of all name's in the group (it returns just the name value of the row in the created column)

DT[, class.students := paste(name), by = "class"]

Especially when the code below with max works as expected, i.e., it evaluates over all elements in the group and returns the same value for each group.

DT[, class.students := max(name), by = "class"]

What am I missing here?

EDIT: max is a bad example as it doesn't work in the first way, using .SDcols, but I hope what I am trying to convey is clear.

Upvotes: 4

Views: 71

Answers (1)

akrun
akrun

Reputation: 887118

.SD is a list, so it returns an output which may not be desired one (if we check the str). As a small example

paste(list(letters[1:3])) #not the desirable output
#[1] "c(\"a\", \"b\", \"c\")"

paste(letters[1:3]) #did not change anything
#[1] "a" "b" "c"

However, paste also have sep and collapse as arguments

paste(letters[1:3], collapse=", ")
#[1] "a, b, c"

Using the OP's example,

DT[, class.students := paste(name, collapse=", "), by = class]

We would recommend to apply a function directly to .SD, but if there is a single column, we can convert the list to vector either by [[ or converting with unlist etc.

DT[,  class.students := paste(unlist(.SD), collapse=", "), by = class]

Or

DT[, class.students := paste(.SD[[1]], collapse=", "), by = class]

If we check the str(DT) from all the above, it would be the same


Regarding the optimal way to apply functions on .SD - as we already mentioned that it is a list. and .SD is useful when there are more number of columns. As in data.frame, when there are multiple column, we loop through the columns with lapply and proceed

DT[, class.students := lapply(.SD, paste, collapse=", "), by = class]

We can also specify the .SDcols if there are only a subset of columns are used. Here, in the example, there are only two columns, so .SDcols is not needed.

str(DT)
#Classes ‘data.table’ and 'data.frame':  7 obs. of  3 variables:
# $ name          : chr  "John" "Smith" "Jane" "Ruby" ...
# $ class         : num  1 2 3 1 2 3 1
# $ class.students: chr  "John, Ruby, Tulip" "Smith, Emerald" "Jane, Jasmine" "John, Ruby, Tulip" ...
# - attr(*, ".internal.selfref")=<externalptr> 

Upvotes: 3

Related Questions