RMAkh
RMAkh

Reputation: 123

Remove rows with zero-variance in R

I have a dataframe of survey responses (rows = participants, columns = question responses). Participants would respond to 50 questions on a 5-point Likert scale. I would like to remove participants who answered 5 across the 50 questions as they have zero-variance and likely to bias my results.

I have seen the nearZeroVar()function, but was wondering if there's a way to do this in base R?

Many thanks,

R

Upvotes: 2

Views: 4609

Answers (4)

moodymudskipper
moodymudskipper

Reputation: 47330

Stealing @AshOfFire's data, with small modification as you say you only have answers in columns and not participants :

survey <- data.frame(q1 = c(1,2,5,5,5,1,2,3,4,2), 
                     q2 = c(1,2,5,5,5,1,2,3,4,3), 
                     q3 = c(3,2,5,4,5,5,2,3,4,5))

survey[!apply(survey==survey[[1]],1,all),]

#    q1 q2 q3
# 1   1  1  3
# 4   5  5  4
# 6   1  1  5
# 10  2  3  5

the equality test builds a data.frame filled with Booleans, then with apply we keep rows that aren't always TRUE.

Upvotes: 0

s_baldur
s_baldur

Reputation: 33518

# Dummy data:
df <- data.frame(
  matrix(
    sample(1:5, 100000, replace =TRUE), 
    ncol = 5
  )
)
names(df) <- paste0("likert", 1:5)
df$id <- 1:nrow(df)
head(df)
  likert1 likert2 likert3 likert4 likert5 id
1       1       2       4       4       5  1
2       5       4       2       2       1  2
3       2       1       2       1       5  3
4       5       1       3       3       2  4
5       4       3       3       5       1  5
6       1       3       3       2       3  6
dim(df)
[1] 20000     6

# Clean out rows where all likert values are 5
df <- df[rowSums(df[grepl("likert", names(df))] == 5) != 5, ]
nrow(df)
[1] 19995

Upvotes: 0

clemens
clemens

Reputation: 6813

If you had this dataframe:

df <- data.frame(col1 = rep(1, 10),
                 col2 = 1:10,
                 col3 = rep(1:2, 5))

You could calculate the variance of each column and select only those columns where the variance is not 0 or greater than or equal to a certain threshold which is close to what nearZeroVar() would do:

df[, sapply(df, var) != 0]
df[, sapply(df, var) >= 0.3]

If you wanted to exclude rows, you could do something similar, but loop through the rows instead and then subset:

df[apply(df, 1, var) != 0, ]
df[apply(df, 1, var) >= 0.3, ]

Upvotes: 3

AshOfFire
AshOfFire

Reputation: 676

Assuming you have data like this.

survey <- data.frame(participants = c(1:10),
                     q1 = c(1,2,5,5,5,1,2,3,4,2), 
                     q2 = c(1,2,5,5,5,1,2,3,4,3), 
                     q3 = c(3,2,5,4,5,5,2,3,4,5))

You can do the following.

idx <- which(apply(survey[,-1], 1, function(x) all(x == 5)) == T)
survey[-idx,]

This will remove rows where all values equal 5.

Upvotes: 1

Related Questions