SimonSimon
SimonSimon

Reputation: 67

Assignment by group in R data.table

Consider the following data:

data <- data.table(ID = 1:10, x = c(1,1,0,0,0,1,1,0,0,1), y = c(0,0,1,1,1,1,1,0,1,1))

I want to assign the following value in a column new: For each pair of x and y (e.g. x = 1 and y = 1), find the row highest up in the list where this specific pair occurs and let new be the ID of this row. For example, for all rows where x = 1 and y = 1, I want to new to be 6.

The following line of code seems to do precisely that:

data[, new := head(.SD, 1), .(x, y)]

My question is just, why does this work? head(.SD, 1) will be a list containing one row, how can R know I want to assign specifically the value in the first column of head(.SD, 1) to new? I was expecting an error when trying to run this code, but I do actually get the desired output.

Upvotes: 4

Views: 906

Answers (1)

Drumy
Drumy

Reputation: 460

This is a very interesting observation.

First, I want to clarify that when you do by = .(x, y), your .SD consists of only one column (ID). You can test that by comparing these two lines of code below.

This first line asks for the first column of .SD, and it works:

data[, head(.SD[, 1]), .(x, y)]
    x y ID
 1: 1 0  1
 2: 1 0  2
 3: 0 1  3
 4: 0 1  4
 5: 0 1  5
 6: 0 1  9
 7: 1 1  6
 8: 1 1  7
 9: 1 1 10
10: 0 0  8

But the second line below asks for the second column of .SD and it gets an error:

data[, head(.SD[, 2]), .(x, y)]
Error in `[.data.table`(.SD, , 2) : 
  Item 1 of j is 2 which is outside the column number range [1,ncol=1]

You see, .SD has only one column (ID). It does not contain the columns in by. That's why your code works as expected.

However, your observation is still valid. Consider the expanded data.table with two columns beside x and y.

data <- data.table(ID1 = 1:10, 
                   ID2 = letters[1:10],
                   x = c(1,1,0,0,0,1,1,0,0,1),
                   y = c(0,0,1,1,1,1,1,0,1,1))
data[, new := head(.SD, 1), .(x, y)]
data
    ID1 ID2 x y new
 1:   1   a 1 0   1
 2:   2   b 1 0   1
 3:   3   c 0 1   3
 4:   4   d 0 1   3
 5:   5   e 0 1   3
 6:   6   f 1 1   6
 7:   7   g 1 1   6
 8:   8   h 0 0   8
 9:   9   i 0 1   3
10:  10   j 1 1   6

new takes ID1 only. Why? According to the help page of :=, when there is a type mismatch between the LHS and the RHS of :=, "the RHS is coerced to match type of the LHS". In this case the RHS is a list of 3, and the LHS is a list of 1, so only the first element is taken.

Upvotes: 3

Related Questions