Reputation: 16856
In my dataframe, I have multiple columns with student grades. I want to sum the "Quiz" columns (e.g., Quiz1, Quiz2). However, I only want to sum the top 2 values, and ignore the others. I want to create a new column with the total (i.e., the sum of the top 2 values).
One issue is that some students have grades that tie for the top 2 grades in a given row. For example, Aaron has a high score of 42, but then there are two scores that tie for the second highest (i.e., 36).
Data
df <-
structure(
list(
Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia",
"Timothy", "Grant", "Chloe", "Judy", "Justin"),
ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096,
98432, 19487, 94029),
Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame")
)
I know that I can use pivot_longer
to do this, which allows me to arrange by group, then take the top 2 values for each student. This works fine, but I would like a more efficient way with tidyverse
, rather than having to pivot back and forth.
What I Tried
library(tidyverse)
df %>%
pivot_longer(-c(Student, ID)) %>%
group_by(Student, ID) %>%
arrange(desc(value), .by_group = TRUE) %>%
slice_head(n = 2) %>%
pivot_wider(names_from = name, values_from = value) %>%
ungroup() %>%
mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))
I also know that if I wanted to sum all the columns on each row, then I could use rowSums
, as I made use of above. However, I am unsure how to do rowSums
of just the top 2 values in the 4 quiz columns.
Expected Output
# A tibble: 10 × 7
Student ID Quiz2 Quiz3 Quiz1 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 42 36 NA NA 78
2 Charlotte 61755 NA NA 41 39 80
3 Chloe 98432 49 NA 42 NA 91
4 Grant 34096 48 44 NA NA 92
5 James 87311 NA 36 NA 43 79
6 Judy 19487 NA 42 NA 41 83
7 Justin 94029 NA 37 NA 41 78
8 Katie 55323 NA 32 NA 46 78
9 Olivia 94839 NA 43 NA 40 83
10 Timothy 38209 38 38 NA NA 76
Upvotes: 9
Views: 1333
Reputation: 25323
Yet another solution, based on tidyverse
:
library(tidyverse)
df %>%
rowwise %>%
mutate(Quiz = list(c_across(starts_with("Quiz")) *
if_else(rank(c_across(starts_with("Quiz")),ties.method="last")>=3,1,NA_real_)),
across(matches("\\d$"), ~ NULL), total = sum(Quiz, na.rm = T)) %>%
unnest_wider(Quiz, names_sep = "")
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 NA 42 36 NA 78
#> 2 James 87311 NA NA 36 43 79
#> 3 Charlotte 61755 41 NA NA 39 80
#> 4 Katie 55323 NA NA 32 46 78
#> 5 Olivia 94839 NA NA 43 40 83
#> 6 Timothy 38209 NA 38 38 NA 76
#> 7 Grant 34096 NA 48 44 NA 92
#> 8 Chloe 98432 42 49 NA NA 91
#> 9 Judy 19487 NA NA 42 41 83
#> 10 Justin 94029 NA NA 37 41 78
Upvotes: 2
Reputation: 16856
As @akrun provided above, collapse
is another efficient possibility. radixorder
provides an integer ordering vector, and only the top 2 values in each row are kept, while the others are replaced with NA
. Then, rowSums
is used to get the totals for each row.
library(collapse)
ftransform(gvr(df, "Student|ID"),
dapply(
gvr(df, "^Quiz"),
MARGIN = 1,
FUN = function(x)
replace(x, radixorder(radixorder(x)) %in% 1:2, NA)
)) %>%
ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE))
Output
# A tibble: 10 × 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78
Upvotes: 1
Reputation: 79208
You do not have to do pivot_wider
. Note that the longer format is the tidy format. Just do pivot_longer
and left_join
:
df %>%
left_join(pivot_longer(., -c(Student, ID)) %>%
group_by(Student, ID) %>%
summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))
# A tibble: 10 x 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 31 42 36 36 78
2 James 87311 25 33 36 43 79
3 Charlotte 61755 41 34 34 39 80
4 Katie 55323 10 22 32 46 78
5 Olivia 94839 35 23 43 40 83
6 Timothy 38209 19 38 38 38 76
7 Grant 34096 27 48 44 43 92
8 Chloe 98432 42 49 42 35 91
9 Judy 19487 15 23 42 41 83
10 Justin 94029 20 30 37 41 78
Upvotes: 3
Reputation: 19088
Try this base R to also get the NA
s
cbind( df[,1:2], t( sapply( seq_along( 1:nrow( df ) ), function(x){
ord <- order( unlist( df[x,3:6] ) )[1:2]; arow <- df[x,3:6];
arow[ord] <- NA; ttl <- rowSums( arow[-ord], na.rm=T );
cbind( arow,Total=ttl ) } ) ) )
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78
Upvotes: 1
Reputation: 5788
(A bit messy) Base R Solution:
# Store the names of quiz columns as a vector: quiz_colnames => character vector
quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE)
# Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector
non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)]
# Store an Idx based on the ID: Idx => integer vector:
Idx <- with(df, as.integer(factor(ID, levels = unique(ID))))
# Split-Apply-Combine to calculate the top 2 quizes: res => data.frame
res <- data.frame(
do.call(
rbind,
lapply(
with(
df,
split(
df,
Idx
)
),
function(x){
# Extract the top 2 quiz vectors: top_2_quizes => named integer vector
top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2)
# Calculate the quiz columns not used: remainder_quiz_cols => character vector
remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))]
# Nullify the remaining quizes: x => data.frame
x[, remainder_quiz_cols] <- NA_integer_
# Calculate the resulting data.frame: data.frame => env
transform(
cbind(
x[,non_quiz_names],
x[,names(top_2_quizes)],
x[,remainder_quiz_cols]
),
Total = sum(top_2_quizes)
)[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")]
}
)
),
row.names = NULL,
stringsAsFactors = FALSE
)
Upvotes: 1
Reputation: 23737
with base R - select just the quiz result columns and you can treat it like a matrix. apply sort in decreasing order, subsetting first two elements, and then use colSums.
df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2]))
df
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78
Upvotes: 4
Reputation: 2321
Based on this StackOverflow answer.
library(tidyverse)
df <-
structure(
list(
Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia",
"Timothy", "Grant", "Chloe", "Judy", "Justin"),
ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096,
98432, 19487, 94029),
Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame")
)
df %>%
rowwise() %>%
mutate(Quiz_Total = sum(sort(c(Quiz1,Quiz2,Quiz3,Quiz4), decreasing = TRUE)[1:2])) %>%
ungroup()
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz_Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78
Upvotes: 5