Sarah Ellen
Sarah Ellen

Reputation: 5

Ordering dataframe based on same factor values across columns

I have a dataframe with the same factor values across multiple columns and I would like to sort/subset my data based on the number of values in a column per row.

df <- data.frame(a = factor(c("yes", "yes", "no", "maybe"), 
levels = c("yes", "no", "maybe")), b = factor(c("maybe", "yes", "yes", "no"), 
levels = c("yes", "no", "maybe")), c = factor(c("maybe", "yes", "yes", "no"), 
levels = c("yes", "no", "maybe")), d = c(1,2,3,4))

df
      a     b     c d
1   yes maybe maybe 1
2   yes   yes   yes 2
3    no   yes   yes 3
4 maybe    no    no 4

I would like to order/subset the data based on the number of times "yes" occurs over all the columns for each row. So subset all rows where "yes" occurs 2 or more times (df2) and then (less importantly) sort based on this where rows with the most amount of "yes" values are at the top. It's not important if the original row numbers are retained.

df2
      a     b     c d
2   yes   yes   yes 2
3    no   yes   yes 3

df
      a     b     c d
2   yes   yes   yes 2
3    no   yes   yes 3
1   yes maybe maybe 1
4 maybe    no    no 4

I thought about using the order() function:

df[order(df$a,df$b,df$c), ]

but this doesn't return what I want. I think I need to use lapply() but I'm not really sure what function to use.

Upvotes: 0

Views: 64

Answers (1)

tblznbits
tblznbits

Reputation: 6778

We can use rowSums for this.

df <- data.frame(a = factor(c("yes", "yes", "no", "maybe"), 
levels = c("yes", "no", "maybe")), b = factor(c("maybe", "yes", "yes", "no"), 
levels = c("yes", "no", "maybe")), c = factor(c("maybe", "yes", "yes", "no"), 
levels = c("yes", "no", "maybe")), d = c(1,2,3,4))

df2 <- df[rowSums(df == "yes") >= 2, ]

df2
#     a   b   c d Count
# 2 yes yes yes 2     3
# 3  no yes yes 3     2

This takes care of the filtering method. But if we want to also sort by those with the most "yes" values, we can first set it as a column in the data, then filter and sort, then remove the column

df$Count <- rowSums(df == "yes")
df <- df[df$Count >= 2, ]
df <- df[order(df$Count, decreasing = TRUE), ]
df <- subset(df, select = -c(Count))
df
#     a   b   c d
# 2 yes yes yes 2
# 3  no yes yes 3

Upvotes: 4

Related Questions