Reputation: 123
I have a dataframe of survey responses (rows = participants, columns = question responses). Participants would respond to 50 questions on a 5-point Likert scale. I would like to remove participants who answered 5 across the 50 questions as they have zero-variance and likely to bias my results.
I have seen the nearZeroVar()function, but was wondering if there's a way to do this in base R?
Many thanks,
R
Upvotes: 2
Views: 4609
Reputation: 47330
Stealing @AshOfFire's data, with small modification as you say you only have answers in columns and not participants :
survey <- data.frame(q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
survey[!apply(survey==survey[[1]],1,all),]
# q1 q2 q3
# 1 1 1 3
# 4 5 5 4
# 6 1 1 5
# 10 2 3 5
the equality test builds a data.frame
filled with Booleans, then with apply we keep rows that aren't always TRUE
.
Upvotes: 0
Reputation: 33518
# Dummy data:
df <- data.frame(
matrix(
sample(1:5, 100000, replace =TRUE),
ncol = 5
)
)
names(df) <- paste0("likert", 1:5)
df$id <- 1:nrow(df)
head(df)
likert1 likert2 likert3 likert4 likert5 id
1 1 2 4 4 5 1
2 5 4 2 2 1 2
3 2 1 2 1 5 3
4 5 1 3 3 2 4
5 4 3 3 5 1 5
6 1 3 3 2 3 6
dim(df)
[1] 20000 6
# Clean out rows where all likert values are 5
df <- df[rowSums(df[grepl("likert", names(df))] == 5) != 5, ]
nrow(df)
[1] 19995
Upvotes: 0
Reputation: 6813
If you had this dataframe:
df <- data.frame(col1 = rep(1, 10),
col2 = 1:10,
col3 = rep(1:2, 5))
You could calculate the variance of each column and select only those columns where the variance is not 0
or greater than or equal to a certain threshold which is close to what nearZeroVar()
would do:
df[, sapply(df, var) != 0]
df[, sapply(df, var) >= 0.3]
If you wanted to exclude rows, you could do something similar, but loop through the rows instead and then subset:
df[apply(df, 1, var) != 0, ]
df[apply(df, 1, var) >= 0.3, ]
Upvotes: 3
Reputation: 676
Assuming you have data like this.
survey <- data.frame(participants = c(1:10),
q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
You can do the following.
idx <- which(apply(survey[,-1], 1, function(x) all(x == 5)) == T)
survey[-idx,]
This will remove rows where all values equal 5.
Upvotes: 1