Reputation: 167
Load library and sample data:
library(MASS)
View(Cars93)
Cars93$ID=1:93
Now I want to subset Cars93
so that new df (sub0l
and sub0h
) have all IDs with all columns but with only top (for df sub0h
) and lowest 10% values (for df sub0l
) in column 17:25, and rest values (11-100 quartile for df sub0l
and 0-90 quartile for df sub0h
) could be changed to NA.
Here is my attempt to create two dfs with top ten% or lowest ten% values from columns 17:25:
sub0l <- do.call(rbind,by (Cars93,Cars93$ID,FUN= function(x)
subset(Cars93, (Cars93[,17:25] <= quantile(Cars93[,17:25], probs= .10)))))
sub0h <- do.call(rbind,by (Cars93,Cars93$ID,FUN= function(x)
subset(Cars93, (Cars93[,17:25] >= quantile(Cars93[,17:25], probs= .91)))))
I get an error while subseting top and lowest ten quartiles of columns:
Error in `[.data.frame`(Cars93, ,17:25) : undefined columns selected
Called from: `[.data.frame`(Cars93, ,17:25)
Any better alternative?
Upvotes: 1
Views: 1775
Reputation: 38500
I think the following returns what you are looking for
sub0l <- cbind(Cars93[,1:16], sapply(Cars93[,17:25],
function(i) ifelse(i > quantile(i, probs=0.1, na.rm=T) | is.na(i), NA, i)))
sub0h <- cbind(Cars93[,1:16], sapply(Cars93[,17:25],
function(i) ifelse(i < quantile(i, probs=0.91, na.rm=T) | is.na(i), NA, i)))
The sapply
function loops through each variable in the data.frame, to which the quantile function is applied. Within each pass, the generic function accesses the variable as a vector through the "i" argument. This is then passed to the ifelse
function. This function takes a look at each element of the vector, i and assesses whether it passes the test. If the element passes the test, it is assigned NA, if it fails, its original value is returned. This process will work great for variables that are numeric.
If some of the variables are not numeric, then you can add an additional check in the sapply
functions as below:
sub0l <- cbind(Cars93[,1:16],
sapply(Cars93[,17:25],
function(i) {
if(is.numeric(i)) {
ifelse(i > quantile(i, probs=0.1, na.rm=T) | is.na(i), NA, i)))
}
else i
}))
sub0h <- cbind(Cars93[,1:16],
sapply(Cars93[,17:25],
function(i) {
if(is.numeric(i)) {
ifelse(i < quantile(i, probs=0.91, na.rm=T) | is.na(i), NA, i)
}
else i
}))
before beginning the operation described above, the generic function checks if the vector i is of type numeric (in R, this is either mode double or integer, see ?typeof
for a discussion of the core element types in R). If this test fails, the vector is returned by else i
. If the first test passes, then the process described above begins.
Upvotes: 2