Reputation: 303
I have a very large data set, and it looks like this one below:
df <- data.frame(school=c("a", "a", "a", "b","b","c","c","c"), year=c(3,3,1,4,2,4,3,1), GPA=c(4,4,4,3,3,3,2,2))
school year GPA
a 3 4
a 3 4
a 1 4
b 4 3
b 2 3
c 4 3
c 3 2
c 1 2
and I want it to be look like:
school year GPA
a 3 4
a 3 4
b 4 3
c 4 3
So basically, what I want is for each given school, I want their top year student(students), regardless of the GPA.
I have tried:
new_df <- df[!duplicated(paste(df[,1],df[,2])),]
but this gives me the unique combination between the school and year.
while the one below gives me the unique school
new_df2 <- df[!duplicated(df$school),]
Upvotes: 4
Views: 199
Reputation: 93813
I'm a fan of the by
statement (see ?by
) for this kind of thing. df
is split into groups on the basis of df$school
and then the rows of each school which represent the max(year)
are returned.
> by(df,df$school,function(x) x[x$year==max(x$year),])
df$school: a
school year GPA
1 a 3 4
2 a 3 4
------------------------------------------------------------
df$school: b
school year GPA
4 b 4 3
------------------------------------------------------------
df$school: c
school year GPA
6 c 4 3
do.call(rbind...
just joins up the results for each school which are returned from the by
statement.
do.call(rbind,by(df,df$school,function(x) x[x$year==max(x$year),]))
school year GPA
a.1 a 3 4
a.2 a 3 4
b b 4 3
c c 4 3
Upvotes: 5
Reputation: 7475
Using the plyr
library
require(plyr)
ddply(df,.(school),function(x){x[x$year==max(x$year),]})
> ddply(df,.(school),function(x){x[x$year==max(x$year),]})
school year GPA
1 a 3 4
2 a 3 4
3 b 4 3
4 c 4 3
or base
test<-lapply(split(df,df$school),function(x){x[x$year==max(x$year),]})
out<-do.call(rbind,test)
> out
school year GPA
a.1 a 3 4
a.2 a 3 4
b b 4 3
c c 4 3
Explanation:
split
splits the dataframe into a list by schools.
dat<-split(df,df$school)
> dat
$a
school year GPA
1 a 3 4
2 a 3 4
3 a 1 4
$b
school year GPA
4 b 4 3
5 b 2 3
$c
school year GPA
6 c 4 3
7 c 3 2
8 c 1 2
for each school we want the members in the top year.
dum.fun<-function(x){x[x$year==max(x$year),]}
> dum.fun(dat$a)
school year GPA
1 a 3 4
2 a 3 4
lapply
applies a function over the members of a list and outputs a list
> lapply(split(df,df$school),function(x){x[x$year==max(x$year),]})
$a
school year GPA
1 a 3 4
2 a 3 4
$b
school year GPA
4 b 4 3
$c
school year GPA
6 c 4 3
this is what we want but in list form. We need to bind the members of the list together. We do this by calling rbind
on the members successively using do.call
.
Upvotes: 6