R: How to delete rows of a data frame based on the values of a given column

Question

I have 100 simulated data sets, for example a single set is shown below

pid time status
1    2     1
1    6     0
1    4     1
2    3     0
2    1     1
2    7     1
3    8     1
3    11    1
3    2     0

pid denotes patient id. This indicates that each patient has three records on the time and status column. I want to write R code to delete any row with 0 status if that row is not a record for the first observation of a given patient and keep rows with 0 status if it denotes the first observation while the remaining rows with status 1 following the this 0 are deleted for that patient. The output should look like

pid time status
1    2     1
1    4     1
2    3     0
3    8     1
3    11    1

As there are 100 simulated data sets the positions of 0's and 1's in the status column are not the same for all the data. Could anyone be of help to provide R code that can perform this task? Thank you in advance.

phiver · Accepted Answer

dplyr package can help. I added a record to your data example to include multiple 0 values for a pid.

Group by pid and with the function first you can hold the first value of status. Due to the group by this will be held for all the records per pid. Then just filter if the first record is 0 and row_number() = 1 just in case there are more records with 0 (see pid 4) or if the first record has status = 1 and keep all the records with status 1.

df %>% 
  group_by(pid) %>% 
  filter((first(status) == 0 & row_number() == 1) | (first(status) == 1 & status == 1))

# A tibble: 6 x 3
# Groups:   pid [4]
    pid  time status
     
1     1     2      1
2     1     4      1
3     2     3      0
4     3     8      1
5     3    11      1
6     4     3      0

data:

df <-
  structure(
    list(
      pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
      time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L, 3L, 6L, 8L),
      status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L)
    ),
    .Names = c("pid", "time", "status"),
    class = "data.frame",
    row.names = c(NA,-12L)
  )

R: How to delete rows of a data frame based on the values of a given column

Answers (2)

Related Questions