socialscientist
socialscientist

Reputation: 4242

Subsetting efficiently on multiple columns and rows

I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.

I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:

# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]

# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]

If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.

newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9) 
newdf <- df[which(df$vars %in% vals, ]

It is my understanding I want to use apply() but I am not sure how.

Upvotes: 0

Views: 365

Answers (2)

black_sheep07
black_sheep07

Reputation: 2368

Use data.table

First, melt your data

library(data.table)

DT <- melt.data.table(df)

Then split into lists

DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have. 

Now you can operate on the lists recursively using lapply

DTresult <- lapply(DTLists, function(x) {
                      ...
                      }

Upvotes: 1

Raad
Raad

Reputation: 2715

You do not need to use which with %in%, it returns boolean values. How about the below:

keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]

Upvotes: 1

Related Questions