Reputation: 601
I have 100 simulated data sets, for example a single set is shown below
pid time status
1 2 1
1 6 0
1 4 1
2 3 0
2 1 1
2 7 1
3 8 1
3 11 1
3 2 0
pid denotes patient id. This indicates that each patient has three records on the time and status column. I want to write R code to delete any row with 0 status if that row is not a record for the first observation of a given patient and keep rows with 0 status if it denotes the first observation while the remaining rows with status 1 following the this 0 are deleted for that patient. The output should look like
pid time status
1 2 1
1 4 1
2 3 0
3 8 1
3 11 1
As there are 100 simulated data sets the positions of 0's and 1's in the status column are not the same for all the data. Could anyone be of help to provide R code that can perform this task? Thank you in advance.
Upvotes: 1
Views: 645
Reputation: 3494
This question is more appropriate on https://stackoverflow.com.
Here is an attempt using tapply()
(it's a little verbose):
dat <- structure(list(pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L)),
.Names = c("pid", "time", "status"), class = "data.frame",
row.names = c(NA, -9L))
ind <- unlist(tapply(dat$status, dat$pid, function(x) {
# browser()
y <- (rep(FALSE, length(x)))
if (x[1] == 1) {
y[x != 0] <- TRUE
} else {
y[1] <- TRUE
}
y
}))
dat[ind, ]
#> pid time status
#> 1 1 2 1
#> 3 1 4 1
#> 4 2 3 0
#> 7 3 8 1
#> 8 3 11 1
ind
is a vector of TRUE
s and FALSE
s, which will indicate whether a row of dat
should be kept or not according to your rules.
I use tapply(X, INDEX, FUN)
to apply a function to subsets of a vector (here X = dat$status
), which are defined by a grouping factor (here INDEX = dat$pid
).
Here, I used an anonymous function (i.e., FUN = function(x){}
) to do something with each subset of X
.
In particular, I first define y
, which I will return later, to be a vector of FALSE
s.
If the first status is 1 for a subgroup, I turn all elements that are non-zero (i.e., y[x != 0]
) into TRUE
.
Otherwise, I turn only the first element (i.e., y[1]
) into TRUE
.
You may uncomment the browser()
statement and see at the console what the function does by typing n
(for next) or x
or y
(to see what they are).
Upvotes: 1
Reputation: 23608
dplyr
package can help. I added a record to your data example to include multiple 0 values for a pid.
Group by pid and with the function first
you can hold the first value of status. Due to the group by this will be held for all the records per pid. Then just filter if the first record is 0 and row_number() = 1 just in case there are more records with 0 (see pid 4) or if the first record has status = 1 and keep all the records with status 1.
df %>%
group_by(pid) %>%
filter((first(status) == 0 & row_number() == 1) | (first(status) == 1 & status == 1))
# A tibble: 6 x 3
# Groups: pid [4]
pid time status
<int> <int> <int>
1 1 2 1
2 1 4 1
3 2 3 0
4 3 8 1
5 3 11 1
6 4 3 0
data:
df <-
structure(
list(
pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L, 3L, 6L, 8L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L)
),
.Names = c("pid", "time", "status"),
class = "data.frame",
row.names = c(NA,-12L)
)
Upvotes: 3