user2491619
user2491619

Reputation: 11

R - using apply to compare previous rows

So I have a dataframe from which I'd like to pull out rows that have the same name in column 2. For each set of duplicated rows with the same name, I'd then like to keep only the row with the highest value if it's score is 2 greater than the other duplicates. So in this example, I want to keep row 2 but not row 5.

>df <- data.frame(score=c(1,5,1,3,3),name=c("A1","A1","A2","A3","A3"))
>df
score name
 1    A1
 5    A1
 1    A2
 3    A3
 3    A3

I can almost do what I want to happen with a for loop and make a little matrix of "dup" vs "keep" to then use to pull out the rows of the dataframe that satisfy both conditions.

>test <- matrix(ncol=1,nrow=nrow(df))
>for(i in 1:nrow(df)){ifelse((df[i,"name"] == df[i-1,"name"]) && (df[i,"score"] >= (df[i-1,"score"] + 2)),test[i] <- "keep",test[i] <- "dup")}
> test
     [,1]  
[1,] NA    
[2,] "keep"
[3,] "dup" 
[4,] "dup" 
[5,] "dup"
>df[which(test[,1] == "keep"),]
    score name
2     5   A1

Which works (apart from the first one), but is obviously ugly and slow as hell. I know there must be a way to do this with some version of apply, but I couldn't work out how to specify the previous row in the function. The actual dataframe is huge, so any tidier way would be great.

Eventually I want the function to also keep rows that have a unique name too, so if this could be incorporated into the same function, I'd be very happy!

Thanks in advance for any help....

Upvotes: 1

Views: 1929

Answers (2)

MSS
MSS

Reputation: 53

Try this:

   
> aggregate(score~name, data=df, max)
   name score
1   A1     5
2   A2     1
3   A3     3

Upvotes: 0

agstudy
agstudy

Reputation: 121568

What about this ?

x <- df[order(df$name),]
x$diff <- ave(x$score, x$name, FUN=function(x) c(NA,diff(x)))
x[duplicated(x$name) & x$diff > 2,]
 score name diff
2     5   A1    4

EDIT

The previous solution is wrong , here the correct one ( I hope). I group elements by name and I keep only rows with a certain conditions ( similar to outlier)

df <- data.frame(score=c(1,5,1,3,3,6,6),name=c("A1","A1","A2","A3","A3","A2","A1"))
by(df$score, df$name, FUN=function(x)
  if(max(x) > 2*max(x[-which.max(x)]))
     max(x)

df$name: A1
[1] NA
------------------------------------------------------------------------------------------------ 
df$name: A2
[1] 6
------------------------------------------------------------------------------------------------ 
df$name: A3
[1] NA
       else NA)

Upvotes: 1

Related Questions