Reputation: 1316
I have a rather large data frame with a factor that has a lot of levels (more than 4,000). I have another column in the same data frame that I'm using as a reference, and what I'd like to find is a subset of the levels whenever this reference column is NA.
The first step I'm using is subsetrows <- which(is.na(mydata$reference))
but after that I'm stuck. I want something like levels(mydata[subsetrows,mydata$factor])
but unfortunately, this command shows me all the levels and not just the ones existing in subsetrows
. I suppose I could create a new vector outside of my data frame of only my subset rows and then drop any unused levels, but is there any easier/cleaner way to do this, possibly without copying my data outside the data frame?
As an example of what I want returned, if my data frame has factor levels from A to Z, but in my subset only P, R and Y appear, I want something that returns the levels P, R and Y.
Upvotes: 2
Views: 2852
Reputation: 1316
I modified a suggestion in the comments by Marat to use the function unique
that seems to return the correct levels.
Solution:
subsetrows <- which(is.na(mydata$reference))
unique(as.character(mydata$factor[subsetrows]))
While I like learning new packages and functions, this solution seems better at this point since it's more compact and easier for me to understand if I need to revisit this code at some distant point in the future.
Upvotes: 0
Reputation: 9123
You can certainly accomplish this with base
functions. But my personal preference is to use dplyr
with chained operations such as this:
library(dplyr)
d %>%
filter(is.na(ref)) %>%
select(field) %>%
distinct()
data
d <- data.frame(
field = c("A", "B", "C", "A", "B", "C"),
ref = c(NA, "a", "b", NA, "c", NA)
)
Upvotes: 2