OverFlow Police
OverFlow Police

Reputation: 861

Choosing every other column of a data table in R. What is the difference between two syntaxes

Lets create data table, data_1. I would like to choose every other column, lets say the odd columns. What is the difference between the following two syntaxes? Why the second one does not work?

data_1 = data.table(col_1 = c(11, 21, 31),
                    col_2 = c(12, 22, 32),
                    col_3 = c(13, 23, 33),
                    col_4 = c(14, 24, 34))

col_dim <- ncol(data_1)
col_dim/2 # this equals 2


odd_cols <- data_1[, c(rep(c(TRUE, FALSE), 2))] # works
odd_cols

   col_1 col_3
1:    11    13
2:    21    23
3:    31    33

odd_cols <- data_1[, c(rep(c(TRUE, FALSE), (col_dim/2)))] # does not work!
odd_cols
[1]  TRUE FALSE  TRUE FALSE

Upvotes: 2

Views: 106

Answers (1)

akrun
akrun

Reputation: 887223

It is better to use with = FALSE with data.table. It gives the same output in both datasets. According to ?data.table

with - By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables. In case of overlapping variables names inside dataset and in parent scope you can use double dot prefix ..cols to explicitly refer to 'cols variable parent scope and not from your datase

out1 <- data_1[, c(rep(c(TRUE, FALSE), (col_dim/2))), with = FALSE]
out2 <- data_1[, c(rep(c(TRUE, FALSE), 2)), with = FALSE]
identical(out1, out2)
#[1] TRUE

If we check with verbose = TRUE

data_1[, c(rep(c(TRUE, FALSE), (col_dim/2))), verbose = TRUE]
#Detected that j uses these columns: <none> 
#[1]  TRUE FALSE  TRUE FALSE

while in the first case, it was treated as the j index directly

data_1[, c(rep(c(TRUE, FALSE), 2)), verbose = TRUE]
#    col_1 col_3
#1:    11    13
#2:    21    23
#3:    31    33

In the first case, we are providing a numeric/integer value, while in the second case, it is also trying to find the object (col_dim) in the global environment. Just to understand, the behavior, done some experiments

1) providing the value of 'col_dim' and then do the division with 2

data_1[, c(rep(c(TRUE, FALSE), 4/2))]
#  col_1 col_3
#1:    11    13
#2:    21    23
#3:    31    33

2) to rule out the type

n1 <- 2L
data_1[, c(rep(c(TRUE, FALSE), n1))]
#[1]  TRUE FALSE  TRUE FALSE

So, it might be the evaluation of global object that changes the behavior

Upvotes: 3

Related Questions