Reputation: 5
I have a dataframe with the same factor values across multiple columns and I would like to sort/subset my data based on the number of values in a column per row.
df <- data.frame(a = factor(c("yes", "yes", "no", "maybe"),
levels = c("yes", "no", "maybe")), b = factor(c("maybe", "yes", "yes", "no"),
levels = c("yes", "no", "maybe")), c = factor(c("maybe", "yes", "yes", "no"),
levels = c("yes", "no", "maybe")), d = c(1,2,3,4))
df
a b c d
1 yes maybe maybe 1
2 yes yes yes 2
3 no yes yes 3
4 maybe no no 4
I would like to order/subset the data based on the number of times "yes" occurs over all the columns for each row. So subset all rows where "yes" occurs 2 or more times (df2
) and then (less importantly) sort based on this where rows with the most amount of "yes" values are at the top. It's not important if the original row numbers are retained.
df2
a b c d
2 yes yes yes 2
3 no yes yes 3
df
a b c d
2 yes yes yes 2
3 no yes yes 3
1 yes maybe maybe 1
4 maybe no no 4
I thought about using the order()
function:
df[order(df$a,df$b,df$c), ]
but this doesn't return what I want. I think I need to use lapply()
but I'm not really sure what function to use.
Upvotes: 0
Views: 64
Reputation: 6778
We can use rowSums
for this.
df <- data.frame(a = factor(c("yes", "yes", "no", "maybe"),
levels = c("yes", "no", "maybe")), b = factor(c("maybe", "yes", "yes", "no"),
levels = c("yes", "no", "maybe")), c = factor(c("maybe", "yes", "yes", "no"),
levels = c("yes", "no", "maybe")), d = c(1,2,3,4))
df2 <- df[rowSums(df == "yes") >= 2, ]
df2
# a b c d Count
# 2 yes yes yes 2 3
# 3 no yes yes 3 2
This takes care of the filtering method. But if we want to also sort by those with the most "yes" values, we can first set it as a column in the data, then filter and sort, then remove the column
df$Count <- rowSums(df == "yes")
df <- df[df$Count >= 2, ]
df <- df[order(df$Count, decreasing = TRUE), ]
df <- subset(df, select = -c(Count))
df
# a b c d
# 2 yes yes yes 2
# 3 no yes yes 3
Upvotes: 4