How to use minus/set complement operator correctly in R?

Question

I wanted to create a new data frame, df2, by subsetting an existing data frame, call it df, by the rows for which the values of one of its columns, call it column, are non-zero.

What I tried at first was:

df2 <- df[-(df$column == 0), ]

However, this did not work. What did work was:

df2 <- df[(df$column != 0), ]

I get why the second one worked, but I don't understand why the first one didn't work, except for operator overloading.

Specifically, running -(df$column == 0) and (df$column !=0) return different results -- the former isn't even a logical vector, but -1 times the logical vector (df$column == 0). So everywhere I wanted the value 1, it had the value 0, and everywhere I wanted it to have the value 0, it had the value -1. Now I know it would not have been that difficult to fix, say by writing
1 -(df$column == 0), but the point is that I did not expect the operator - to behave that way, I expected it to behave as the set complement operator. (I.e. I did not even anticipate that there would have been a problem, so I was not thinking about how to fix the problem.)

Concrete Question: How does the R language decide whether and when to interpret the - operator as: (1) set complement operator (2) subtraction operator (3) multiplication by -1 ?

All of the documentation I found only addresses the use of the operator - as (2) subtraction operator, but doesn't mention how R disambiguates between (1) and (3).

Also, I know that (2) and (3) are more or less mathematically equivalent, but that might not mean that their implementations are the same. (E.g. matrix inversion in MATLAB.)

Yannis Vassiliadis · Accepted Answer

I think the complication arises because you're using the values 0 and 1, which are also the numerical equivalents of True and False. So I will try to explain what went wrong with your code above by using a case that df has only two rows, and different numbers:

df<- data.frame(column = matrix(c(2,3), nrow=2))
> df
  column
1      2
2      3

Calling (df$column==3) returns two logical values:

> df1$column==3
[1] FALSE  TRUE

Because TRUE=1 and FALSE=0, when you call df[-(df$column == 3), ] is the same as calling df[-(c(0,1),], because in both cases you are removing the 1st row (no 0-indexing in R).

> df1[-(df1$column==3),]
[1] 3
> df1[-c(0,1),]
[1] 3

The reverse is true when you call df[(df$column != 3),], because this time you are retaining the 1st row.

> df[(df$column != 3),]
[1] 2
> df[c(1,0), ]
[1] 2

What you were trying to do is remove the row for which df$column==3, but in order to that you need to know what row number it is, so you need its index. That's when you call the which function. So you would do:

df2 <- df[-which(df1$column == 3),]

Other than that, your understanding of how - is used in R is correct, and I think R decides how to use it, based on the context.

Disclaimer I am sorry for the long, and maybe pedantic answer, I just did want to assume anything.

How to use minus/set complement operator correctly in R?

Answers (2)

Related Questions