Reputation: 143
I have the following df
dat <- data.frame(Cases = c("Student3","Student3","Student3","Student1","Student1",
"Student2","Student2","Student2","Student4"), Class = rep("Math", 9),
Scores = c(9,5,2,7,3,8,5,1,7), stringsAsFactors = F)
> dat
Cases Class Scores
1 Student3 Math 9
2 Student3 Math 5
3 Student3 Math 2
4 Student1 Math 7
5 Student1 Math 3
6 Student2 Math 8
7 Student2 Math 5
8 Student2 Math 1
9 Student4 Math 7
On the other hand, I have another df with the following information:
d <- data.frame(Cases = c("Student3", "Student1",
"Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)
Cases Class
1 Student3 Math
2 Student1 Math
3 Student2 Math
4 Student4 Math
With these two, I want to extract the highest scores
for each student
. So my output would look like this:
> dat_output
Cases Class Scores
1 Student3 Math 9
2 Student1 Math 7
3 Student2 Math 8
4 Student4 Math 7
I tried with merge
but it is not extracting just the highest scores
.
Upvotes: 5
Views: 159
Reputation: 39858
Using dplyr
:
df %>%
group_by(Cases, Class) %>%
summarise(Scores = max(Scores))
# A tibble: 4 x 3
# Groups: Cases [?]
Cases Class Scores
<chr> <chr> <dbl>
1 Student1 Math 7.
2 Student2 Math 8.
3 Student3 Math 9.
4 Student4 Math 7.
Considering you want to match the two dfs:
df %>%
right_join(df2, by = c("Cases", "Class")) %>%
group_by(Cases, Class) %>%
summarise(Scores = max(Scores))
# A tibble: 4 x 3
# Groups: Cases [?]
Cases Class Scores
<chr> <chr> <dbl>
1 Student1 Math 7.
2 Student2 Math 8.
3 Student3 Math 9.
4 Student4 Math 7.
Upvotes: 1
Reputation: 4970
You can sort your dataframe on Scores
in descending order using order
. Then remove duplicate Cases
. This is a base
R solution.
dat <- dat[order(-dat$Scores),]
dat[duplicated(dat$Cases)==F,]
Cases Class Scores
1 Student3 Math 9
6 Student2 Math 8
4 Student1 Math 7
9 Student4 Math 7
If you first want to make sure that all samples in dat
are also in d
, you can, in a first step, do this. %in%
performs value matching. It, however, does not make a difference based on the example above.
dat <- dat[dat$Cases %in% d$Cases & dat$Class %in% d$Class,]
Upvotes: 3
Reputation: 388807
We can use sapply
on each Cases
in d
, subset the dat
for that Cases
and get the max
score for it.
sapply(d$Cases, function(x) max(dat$Scores[dat$Cases %in% x]))
#Student3 Student1 Student2 Student4
# 9 7 8 7
To get the result as data.frame
transform(d, Scores = sapply(d$Cases, function(x)
max(dat$Scores[dat$Cases %in% x])))
# Cases Class Scores
# Student3 Math 9
# Student1 Math 7
# Student2 Math 8
# Student4 Math 7
Note - I have assumed your d
to be
d <- data.frame(Cases = c("Student3", "Student1",
"Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)
Upvotes: 6
Reputation: 23101
with dplyr
, and considering the case where your d
contains a subset of students from your dat
library(dplyr)
inner_join(d, dat %>% group_by(Cases, Class) %>% summarize(Scores=max(Scores)))
# Cases Class Scores
#1 Student3 Math 9
#2 Student1 Math 7
#3 Student2 Math 8
#4 Student4 Math 7
if order does not matter then the following is more efficient:
inner_join(dat, d) %>% group_by(Cases, Class) %>% summarize(Scores=max(Scores))
# A tibble: 4 x 3
# Groups: Cases [?]
# Cases Class Scores
# <chr> <chr> <dbl>
#1 Student1 Math 7
#2 Student2 Math 8
#3 Student3 Math 9
#4 Student4 Math 7
Upvotes: 3
Reputation: 56004
Using dplyr, group by students, and get first value based on scores:
library(dplyr)
dat %>%
filter(Cases %in% d$Cases) %>%
group_by(Cases) %>%
top_n(1, Scores) %>%
ungroup()
# # A tibble: 4 x 3
# Cases Class Scores
# <chr> <chr> <dbl>
# 1 Student1 Math 7
# 2 Student2 Math 8
# 3 Student3 Math 9
# 4 Student4 Math 7
Upvotes: 1
Reputation: 3211
You can also use sqldf
package as follow:
sqldf("select max(Scores), Cases from dat JOIN d USING(Cases) group by Cases")
Apply JOIN
operation, group by cases
and select max(Scores),Cases
to get the desired output:
max(Scores) Cases
1 7 Student1
2 8 Student2
3 9 Student3
4 7 Student4
Upvotes: 3
Reputation: 6132
If I am correct you don't need d
, since in d
there is no additional information that is not in dat
already.
You can just do:
dat_output <- aggregate(Scores ~ Cases, dat, max)
dat_output
Cases Scores
1 Student1 7
2 Student2 8
3 Student3 9
4 Student4 7
Upvotes: 3