Reputation: 72769
I'm trying to get a handle on the ubiquitous which
function. Until I started reading questions/answers on SO I never found the need for it. And I still don't.
As I understand it, which
takes a Boolean vector and returns a weakly shorter vector containing the indices of the elements which were true:
> seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
> x <- seq(10)
> tf <- (x == 6 | x == 8)
> tf
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
> w <- which(tf)
> w
[1] 6 8
So why would I ever use which
instead of just using the Boolean vector directly? I could maybe see some memory issues with huge vectors, since length(w)
<< length(tf)
, but that's hardly compelling. And there are some options in the help file which don't add much to my understanding of possible uses of this function. The examples in the help file aren't of much help either.
Edit for clarity-- I understand that the which
returns the indices. My question is about two things: 1) why you would ever need to use the indices instead of just using the boolean selector vector? and 2) what interesting behaviors of which
might make it preferred to just using a vectorized Boolean comparison?
Upvotes: 43
Views: 71705
Reputation: 11946
Surprised no one has answered this: how about memory efficiency?
If you have a long vector of very sparse TRUE
's, then keeping track of only the indices of the TRUE values will probably be much more compact.
Upvotes: 7
Reputation: 9380
I use it quiet often in data exploration. For example if I have a dataset of kids data and see from summary that the max age is 23 (and should be 18), I might go:
sum(dat$age>18)
If that was 67, and I wanted to look closer I might use:
dat[which(dat$age>18)[1:10], ]
Also useful if you're making a presentation and want to pull out a snippet of data to demonstrate a certain oddity or what not.
Upvotes: 4
Reputation: 5700
Okay, here is something where it proved useful last night:
In a given vector of values what is the index of the 3rd non-NA value?
> x <- c(1,NA,2,NA,3)
> which(!is.na(x))[3]
[1] 5
A little different from DWin's use, although I'd say his is compelling too!
Upvotes: 26
Reputation: 174898
The title of the man page ?which
provides a motivation. The title is:
Which indices are
TRUE
?
Which I interpret as being the function one might use if you want to know which elements of a logical vector are TRUE
. This is inherently different to just using the logical vector itself. That would select the elements that are TRUE
, not tell you which of them was TRUE
.
Common use cases were to get the position of the maximum or minimum values in a vector:
> set.seed(2)
> x <- runif(10)
> which(x == max(x))
[1] 5
> which(x == min(x))
[1] 7
Those were so commonly used that which.max()
and which.min()
were created:
> which.max(x)
[1] 5
> which.min(x)
[1] 7
However, note that the specific forms are not exact replacements for the generic form. See ?which.min
for details. One example is below:
> x <- c(4,1,1)
> which.min(x)
[1] 2
> which(x==min(x))
[1] 2 3
Upvotes: 20
Reputation: 28652
which
could be useful (by the means of saving both computer and human resources) e.g. if you have to filter the elements of a data frame/matrix by a given variable/column and update other variables/columns based on that. Example:
df <- mtcars
Instead of:
df$gear[df$hp > 150] <- mean(df$gear[df$hp > 150])
You could do:
p <- which(df$hp > 150)
df$gear[p] <- mean(df$gear[p])
Extra case would be if you have to filter a filtered elements what could not be done with a simple &
or |
, e.g. when you have to update some parts of a data frame based on other data tables. This way it is required to store (at least temporary) the indexes of the filtered element.
Another issue what cames to my mind if you have to loop thought a part of a data frame/matrix or have to do other kind of transformations requiring to know the indexes of several cases. Example:
urban <- which(USArrests$UrbanPop > 80)
> USArrests[urban, ] - USArrests[urban-1, ]
Murder Assault UrbanPop Rape
California 0.2 86 41 21.1
Hawaii -12.1 -165 23 -5.6
Illinois 7.8 129 29 9.8
Massachusetts -6.9 -151 18 -11.5
Nevada 7.9 150 19 29.5
New Jersey 5.3 102 33 9.3
New York -0.3 -31 16 -6.0
Rhode Island -2.9 68 15 -6.6
Sorry for the dummy examples, I know it makes not much sense to compare the most urbanized states of USA by the states prior to those in the alphabet, but I hope this makes sense :)
Checking out which.min
and which.max
gives some clue also, as you do not have to type a lot, example:
> row.names(mtcars)[which.max(mtcars$hp)]
[1] "Maserati Bora"
Upvotes: 12
Reputation: 263451
Two very compelling reasons not to forget which
:
1) When you use "[" to extract from a dataframe, any calculation in the row position that results in NA will get a junk row returned. Using which
removes the NA's. You can use subset
or %in%
, which do not create the same problem.
> dfrm <- data.frame( a=sample(c(1:3, NA), 20, replace=TRUE), b=1:20)
> dfrm[dfrm$a >0, ]
a b
1 1 1
2 3 2
NA NA NA
NA.1 NA NA
NA.2 NA NA
6 1 6
NA.3 NA NA
8 3 8
# Snipped remaining rows
2) When you need the array indicators.
Upvotes: 17
Reputation: 72769
Well, I found one possible reason. At first I thought it might be the ,useNames
option, but it turns out that simple boolean selection does that too.
However, if your object of interest is a matrix, you can use the ,arr.ind
option to return the result as (row,column) ordered pairs:
> x <- matrix(seq(10),ncol=2)
> x
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> which((x == 6 | x == 8),arr.ind=TRUE)
row col
[1,] 1 2
[2,] 3 2
> which((x == 6 | x == 8))
[1] 6 8
That's a handy trick to know about, but hardly seems to justify its constant use.
Upvotes: 11