Reputation: 1231
I wanted to create a new data frame, df2
, by subsetting an existing data frame, call it df
, by the rows for which the values of one of its columns, call it column
, are non-zero.
What I tried at first was:
df2 <- df[-(df$column == 0), ]
However, this did not work. What did work was:
df2 <- df[(df$column != 0), ]
I get why the second one worked, but I don't understand why the first one didn't work, except for operator overloading.
Specifically, running -(df$column == 0)
and (df$column !=0)
return different results -- the former isn't even a logical vector, but -1 times the logical vector (df$column == 0)
. So everywhere I wanted the value 1, it had the value 0, and everywhere I wanted it to have the value 0, it had the value -1. Now I know it would not have been that difficult to fix, say by writing
1 -(df$column == 0)
, but the point is that I did not expect the operator -
to behave that way, I expected it to behave as the set complement operator. (I.e. I did not even anticipate that there would have been a problem, so I was not thinking about how to fix the problem.)
Concrete Question: How does the R language decide whether and when to interpret the -
operator as: (1) set complement operator (2) subtraction operator (3) multiplication by -1 ?
All of the documentation I found only addresses the use of the operator -
as (2) subtraction operator, but doesn't mention how R disambiguates between (1) and (3).
Also, I know that (2) and (3) are more or less mathematically equivalent, but that might not mean that their implementations are the same. (E.g. matrix inversion in MATLAB.)
Upvotes: 1
Views: 2086
Reputation: 7839
The -
operator is implemented as a function that takes one or two arguments.
> `-`
function (e1, e2) .Primitive("-")
So the expression -a
is interpreted to mean -(a)
and a - b
is interpreted as -(a, b)
.
With one argument -
returns the additive inverse (ie. it reverses the sign of the argument), and with two arguments it does subtraction.
> `-`(3)
[1] -3
> `-`(3, 1)
[1] 2
It doesn't do set operations.
Upvotes: 2
Reputation: 1709
I think the complication arises because you're using the values 0 and 1, which are also the numerical equivalents of True
and False
. So I will try to explain what went wrong with your code above by using a case that df
has only two rows, and different numbers:
df<- data.frame(column = matrix(c(2,3), nrow=2))
> df
column
1 2
2 3
Calling (df$column==3)
returns two logical values:
> df1$column==3
[1] FALSE TRUE
Because TRUE=1
and FALSE=0
, when you call df[-(df$column == 3), ]
is the same as calling df[-(c(0,1),]
, because in both cases you are removing the 1st row (no 0-indexing in R).
> df1[-(df1$column==3),]
[1] 3
> df1[-c(0,1),]
[1] 3
The reverse is true when you call df[(df$column != 3),]
, because this time you are retaining the 1st row.
> df[(df$column != 3),]
[1] 2
> df[c(1,0), ]
[1] 2
What you were trying to do is remove the row for which df$column==3
, but in order to that you need to know what row number it is, so you need its index. That's when you call the which
function. So you would do:
df2 <- df[-which(df1$column == 3),]
Other than that, your understanding of how -
is used in R
is correct, and I think R
decides how to use it, based on the context.
Disclaimer I am sorry for the long, and maybe pedantic answer, I just did want to assume anything.
Upvotes: 2