Reputation: 781
I have data sets
ID <- c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4)
x <- c(1,2,3,1,2,3,4,1,2,1,2,3,4,5,6)
y <- c(2,2,3,6,6,4,5, 1,1,5,5,5,2,2,2)
df <- data.frame(ID, x, y)
df
ID x y
1 1 1 2
2 1 2 2
3 1 3 3
4 2 1 6
5 2 2 6
6 2 3 4
7 2 4 5
8 3 1 1
9 3 2 1
10 4 1 5
11 4 2 5
12 4 3 5
13 4 4 2
14 4 5 2
15 4 6 2
If you see ID 1 have 3 rows, by y of the third row change y = 3, so I want to set y = 2 (The same number of previous row), the ID 2 have y change at y = 4, I want to set y = 6 and delete next row. When the number of y change for each ID, we set only the first row change as the same at previous row, the rest remove it.
The table will be
ID x y
1 1 2
1 2 2
1 3 2
2 1 6
2 2 6
2 3 6
3 1 1
3 2 1
4 1 5
4 2 5
4 3 5
4 4 5
I couldn't figure out, do you have any idea, please help me, thanks.
Upvotes: 1
Views: 339
Reputation: 887731
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df)
), grouped by 'ID', if
there is only a unique
element in 'y' get the sequence of rows (1:.N
) or else
get the difference of 'y' (diff
), check whether it is not equal to 0, use which
to return the numeric index of the first TRUE ([1]
),get the sequence and wrap it with .I
to return row index.
library(data.table)
i1 <- setDT(df)[, if(uniqueN(y) >1) .I[seq(which(c(FALSE,diff(y)!=0))[1])]
else .I[1:.N], ID]$V1
Based on 'i1', we subset the rows of 'df', grouped by 'ID', we assign (:=
), the 1st element in 'y' to change the 'y' column.
df[i1][, y:= y[1], ID][]
# ID x y
#1: 1 1 2
#2: 1 2 2
#3: 1 3 2
#4: 2 1 6
#5: 2 2 6
#6: 2 3 6
#7: 3 1 1
#8: 3 2 1
#9: 4 1 5
#10: 4 2 5
#11: 4 3 5
#12: 4 4 5
Or we can use a bit more simple coding with dplyr
. (Disclaimer: The idea is somewhat similar to @Psidom's code). After grouping by 'ID', we get the lag
of 'y', get a logical index by comparing with the first observation, filter
the rows based on that and change the 'y' values to the first value.
library(dplyr)
df %>%
group_by(ID) %>%
filter(first(y)==lag(y, default = first(y))) %>%
mutate(y, y=first(y))
# ID x y
# <dbl> <dbl> <dbl>
#1 1 1 2
#2 1 2 2
#3 1 3 2
#4 2 1 6
#5 2 2 6
#6 2 3 6
#7 3 1 1
#8 3 2 1
#9 4 1 5
#10 4 2 5
#11 4 3 5
#12 4 4 5
Or another option is ave
from base R
df1 <- df[with(df, as.logical(ave(y, ID, FUN = function(x)
lag(x, default= x[1])== x[1]))),]
df1$y <- with(df1, ave(y, ID, FUN= function(x) x[1]))
Upvotes: 2
Reputation: 215117
Or we can do
library(data.table)
df1 <- setDT(df)[, .SD[shift(rleid(y), fill = 1) == 1], .(ID)]
df1[, y := y[1], .(ID)]
df1
ID x y
1: 1 1 2
2: 1 2 2
3: 1 3 2
4: 2 1 6
5: 2 2 6
6: 2 3 6
7: 3 1 1
8: 3 2 1
9: 4 1 5
10: 4 2 5
11: 4 3 5
12: 4 4 5
Upvotes: 4
Reputation: 5600
You could use a for loop, matching to the first instance of a given ID:
for( i in 1:nrow(df) ){
df$new[i] <- df$y[ match( df$ID[i], df$ID ) ]
}
This works because you're effectively asking for all subsequent values of y to be replaced with the first value, for a given ID. match
returns the first value matching a given criteria, which works well for what you're after.
Or you could eliminate the for loop by first extracting ID as a variable:
ID <- df$ID
df$new <- df$y[ match( ID, df$ID ) ]
EDIT TO ADD: Sorry, here's a step to add to delete rows as requested
df <- subset( df, y == new |
( shift( y, 1, type = "lag" ) != y &
shift( ID, 1, type = "lag" ) == ID )
)
Upvotes: 0