Cahidora
Cahidora

Reputation: 143

Select highest values in a dataframe by group

I have the following df

dat <- data.frame(Cases = c("Student3","Student3","Student3","Student1","Student1",
"Student2","Student2","Student2","Student4"), Class = rep("Math", 9),
Scores = c(9,5,2,7,3,8,5,1,7), stringsAsFactors = F)


> dat
   Cases    Class   Scores
1 Student3  Math      9
2 Student3  Math      5
3 Student3  Math      2
4 Student1  Math      7
5 Student1  Math      3
6 Student2  Math      8
7 Student2  Math      5
8 Student2  Math      1
9 Student4  Math      7

On the other hand, I have another df with the following information:

d <- data.frame(Cases = c("Student3", "Student1",
"Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)

    Cases  Class
1 Student3  Math
2 Student1  Math
3 Student2  Math
4 Student4  Math

With these two, I want to extract the highest scores for each student. So my output would look like this:

> dat_output
    Cases  Class   Scores
1 Student3  Math      9
2 Student1  Math      7
3 Student2  Math      8
4 Student4  Math      7

I tried with merge but it is not extracting just the highest scores.

Upvotes: 5

Views: 159

Answers (7)

tmfmnk
tmfmnk

Reputation: 39858

Using dplyr:

df %>% 
  group_by(Cases, Class) %>% 
  summarise(Scores = max(Scores))

# A tibble: 4 x 3
# Groups:   Cases [?]
  Cases    Class Scores
  <chr>    <chr>  <dbl>
1 Student1 Math      7.
2 Student2 Math      8.
3 Student3 Math      9.
4 Student4 Math      7.

Considering you want to match the two dfs:

df %>%  
  right_join(df2, by = c("Cases", "Class")) %>% 
  group_by(Cases, Class) %>% 
  summarise(Scores = max(Scores))

# A tibble: 4 x 3
# Groups:   Cases [?]
  Cases    Class Scores
  <chr>    <chr>  <dbl>
1 Student1 Math      7.
2 Student2 Math      8.
3 Student3 Math      9.
4 Student4 Math      7.

Upvotes: 1

milan
milan

Reputation: 4970

You can sort your dataframe on Scores in descending order using order. Then remove duplicate Cases. This is a base R solution.

dat <- dat[order(-dat$Scores),]
dat[duplicated(dat$Cases)==F,]

     Cases Class Scores
1 Student3  Math      9
6 Student2  Math      8
4 Student1  Math      7
9 Student4  Math      7

If you first want to make sure that all samples in dat are also in d, you can, in a first step, do this. %in% performs value matching. It, however, does not make a difference based on the example above.

dat <- dat[dat$Cases %in% d$Cases & dat$Class %in% d$Class,]

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388807

We can use sapply on each Cases in d, subset the dat for that Cases and get the max score for it.

sapply(d$Cases, function(x) max(dat$Scores[dat$Cases %in% x]))

#Student3 Student1 Student2 Student4 
#       9        7        8        7 

To get the result as data.frame

transform(d, Scores = sapply(d$Cases, function(x) 
                     max(dat$Scores[dat$Cases %in% x])))

#    Cases Class Scores
# Student3  Math      9 
# Student1  Math      7
# Student2  Math      8
# Student4  Math      7

Note - I have assumed your d to be

d <- data.frame(Cases = c("Student3", "Student1",
      "Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)

Upvotes: 6

Sandipan Dey
Sandipan Dey

Reputation: 23101

with dplyr, and considering the case where your d contains a subset of students from your dat

library(dplyr)
inner_join(d, dat %>% group_by(Cases, Class) %>% summarize(Scores=max(Scores)))

# Cases Class Scores
#1 Student3  Math      9
#2 Student1  Math      7
#3 Student2  Math      8
#4 Student4  Math      7

if order does not matter then the following is more efficient:

inner_join(dat, d) %>% group_by(Cases, Class) %>% summarize(Scores=max(Scores))
# A tibble: 4 x 3
# Groups:   Cases [?]
#  Cases    Class Scores
#  <chr>    <chr>  <dbl>
#1 Student1 Math       7
#2 Student2 Math       8
#3 Student3 Math       9
#4 Student4 Math       7

Upvotes: 3

zx8754
zx8754

Reputation: 56004

Using dplyr, group by students, and get first value based on scores:

library(dplyr)

dat %>% 
  filter(Cases %in% d$Cases) %>% 
  group_by(Cases) %>% 
  top_n(1, Scores) %>%
  ungroup()

# # A tibble: 4 x 3
#   Cases    Class Scores
#   <chr>    <chr>  <dbl>
# 1 Student1 Math       7
# 2 Student2 Math       8
# 3 Student3 Math       9
# 4 Student4 Math       7

Upvotes: 1

Saurabh Chauhan
Saurabh Chauhan

Reputation: 3211

You can also use sqldf package as follow:

sqldf("select max(Scores), Cases from dat JOIN d USING(Cases) group by Cases")

Apply JOIN operation, group by cases and select max(Scores),Cases to get the desired output:

   max(Scores)    Cases
1           7    Student1
2           8    Student2
3           9    Student3
4           7    Student4

Upvotes: 3

Lennyy
Lennyy

Reputation: 6132

If I am correct you don't need d, since in d there is no additional information that is not in dat already.

You can just do:

dat_output <- aggregate(Scores ~ Cases, dat, max)
dat_output

     Cases Scores
1 Student1      7
2 Student2      8
3 Student3      9
4 Student4      7

Upvotes: 3

Related Questions