BIN
BIN

Reputation: 781

Remove rows which are different with the first changing in R

I have data sets

 ID <- c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4)
 x <- c(1,2,3,1,2,3,4,1,2,1,2,3,4,5,6)
 y <- c(2,2,3,6,6,4,5, 1,1,5,5,5,2,2,2)
 df <- data.frame(ID, x, y)
 df
    ID x y
1   1 1 2
2   1 2 2
3   1 3 3
4   2 1 6
5   2 2 6
6   2 3 4
7   2 4 5
8   3 1 1
9   3 2 1
10  4 1 5
11  4 2 5
12  4 3 5
13  4 4 2
14  4 5 2
15  4 6 2

If you see ID 1 have 3 rows, by y of the third row change y = 3, so I want to set y = 2 (The same number of previous row), the ID 2 have y change at y = 4, I want to set y = 6 and delete next row. When the number of y change for each ID, we set only the first row change as the same at previous row, the rest remove it.

The table will be

    ID x y
     1 1 2
     1 2 2
     1 3 2
     2 1 6
     2 2 6
     2 3 6
     3 1 1
     3 2 1
     4 1 5
     4 2 5
     4 3 5
     4 4 5

I couldn't figure out, do you have any idea, please help me, thanks.

Upvotes: 1

Views: 339

Answers (3)

akrun
akrun

Reputation: 887731

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', if there is only a unique element in 'y' get the sequence of rows (1:.N) or else get the difference of 'y' (diff), check whether it is not equal to 0, use which to return the numeric index of the first TRUE ([1]),get the sequence and wrap it with .I to return row index.

library(data.table)
i1 <- setDT(df)[, if(uniqueN(y) >1)  .I[seq(which(c(FALSE,diff(y)!=0))[1])] 
                     else .I[1:.N], ID]$V1

Based on 'i1', we subset the rows of 'df', grouped by 'ID', we assign (:=), the 1st element in 'y' to change the 'y' column.

df[i1][, y:= y[1],  ID][]
#    ID x y
#1:  1 1 2
#2:  1 2 2
#3:  1 3 2
#4:  2 1 6
#5:  2 2 6
#6:  2 3 6
#7:  3 1 1
#8:  3 2 1
#9:  4 1 5
#10: 4 2 5
#11: 4 3 5
#12: 4 4 5

Or we can use a bit more simple coding with dplyr. (Disclaimer: The idea is somewhat similar to @Psidom's code). After grouping by 'ID', we get the lag of 'y', get a logical index by comparing with the first observation, filter the rows based on that and change the 'y' values to the first value.

 library(dplyr)
 df %>%
      group_by(ID) %>%
      filter(first(y)==lag(y, default = first(y))) %>% 
      mutate(y, y=first(y))
 #        ID     x     y
 #     <dbl> <dbl> <dbl>
 #1      1     1     2
 #2      1     2     2
 #3      1     3     2
 #4      2     1     6
 #5      2     2     6
 #6      2     3     6
 #7      3     1     1
 #8      3     2     1
 #9      4     1     5
 #10     4     2     5
 #11     4     3     5
 #12     4     4     5

Or another option is ave from base R

 df1 <- df[with(df, as.logical(ave(y, ID, FUN = function(x)
                               lag(x, default= x[1])== x[1]))),]
 df1$y <- with(df1, ave(y, ID, FUN= function(x) x[1]))

Upvotes: 2

akuiper
akuiper

Reputation: 215117

Or we can do

library(data.table)
df1 <- setDT(df)[, .SD[shift(rleid(y), fill = 1) == 1], .(ID)]
df1[, y := y[1], .(ID)]
df1
    ID x y
 1:  1 1 2
 2:  1 2 2
 3:  1 3 2
 4:  2 1 6
 5:  2 2 6
 6:  2 3 6
 7:  3 1 1
 8:  3 2 1
 9:  4 1 5
10:  4 2 5
11:  4 3 5
12:  4 4 5

Upvotes: 4

rosscova
rosscova

Reputation: 5600

You could use a for loop, matching to the first instance of a given ID:

for( i in 1:nrow(df) ){ 
    df$new[i] <- df$y[ match( df$ID[i], df$ID ) ] 
}

This works because you're effectively asking for all subsequent values of y to be replaced with the first value, for a given ID. match returns the first value matching a given criteria, which works well for what you're after.

Or you could eliminate the for loop by first extracting ID as a variable:

ID <- df$ID
df$new <- df$y[ match( ID, df$ID ) ]

EDIT TO ADD: Sorry, here's a step to add to delete rows as requested

df <- subset( df, y == new | 
        ( shift( y, 1, type = "lag" ) != y & 
            shift( ID, 1, type = "lag" ) == ID ) 
        )

Upvotes: 0

Related Questions