Reputation: 125

Delete rows with redundant information in r (not just duplicates)

In this sample data:

id<-c(2,2,2,2,2,3,3,3,3,3,3,4,4,4,4)
time<-c(3,5,7,8,9,2,8,10,12,14,18,4,6,7,9)
status<-c('mar','mar','div','c','mar','mar','div','mar','mar','c','div','mar','mar','c','mar')

myd<-data.frame(id,time,status)

   id time status
1   2    3    mar
2   2    5    mar
3   2    7    div
4   2    8      c
5   2    9    mar
6   3    2    mar
7   3    8    div
8   3   10    mar
9   3   12    mar
10  3   14      c
11  3   18    div
12  4    4    mar
13  4    6    mar
14  4    7      c
15  4    9    mar

I need to know when the person married (if there are two consecutive 'mar' rows without 'div' anywhere in between, the person never divorced, hence it's the same marriage, and we don't need the timing of that repeat information; the same goes with sequence of mar, c, mar where since div is not detected, the marriage before and after child are the same marriage, hence the second one can be deleted). I suspect I need to get min(time[status=='mar']) but this would be incorrect if that person gets a mar,mar,div,mar,div,mar sequence (only 2nd mar needs deletion, not all the ones after the first one).

So the new data should look something like

   id time status
2   2    5    mar
3   2    7    div
4   2    8      c
5   2    9    mar
6   3    2    mar
7   3    8    div
8   3   10    mar
10  3   14      c
11  3   18    div
13  4    6    mar
14  4    7      c

This was my approach, which only worked for one row

myd2<-myd %>% 
  group_by(id) %>% 
  mutate(dum1=ifelse(status=='mar',min(time[status=='mar']),NA),
         dum2=cumsum(status=='div'),
         flag=ifelse(time>dum1 & dum2==0,1,0))

If I get rid of dum2==0 then it deleted too many rows.

Upvotes: 1

Answers (4)

r2evans

Reputation: 160417

Using a quick helper function,

func <- function(x, vals = c("mar", "div")) {
  out <- rep(TRUE, length(x))
  last <- x[1]
  for (ind in seq_along(x)[-1]) {
    out[ind] <- x[ind] != last || !x[ind] %in% vals
    if (out[ind] && x[ind] %in% vals) last <- x[ind]
  }
  out
}

We can do

library(data.table)
as.data.table(myd)[, .SD[func(status),], by = .(id)]
#        id  time status
#     <num> <num> <char>
#  1:     2     3    mar
#  2:     2     7    div
#  3:     2     8      c
#  4:     2     9    mar
#  5:     3     2    mar
#  6:     3     8    div
#  7:     3    10    mar
#  8:     3    14      c
#  9:     3    18    div
# 10:     4     4    mar
# 11:     4     7      c

If you want this in dplyr, then

library(dplyr)
myd %>%
  group_by(id) %>%
  filter(func(status))

Upvotes: 2

Teddly

Reputation: 110

I read two different questions in your post.

When the person first married
How to make a list that removes redundant status information

It seems like you have a solution for #1, but you actually want #2.

I read #2 as a desire to filter out rows where the id and status are the same as the previous row. That would look like:

myd %>%
  filter(!(id == lag(id) & status == lag(status))

Upvotes: 1

ktiu

Reputation: 2626

My approach:

library(dplyr)

myd %>%
  group_by(id) %>% 
  arrange(time) %>%
  filter(status != lag(status) | is.na(lag(status))) %>%
  ungroup() %>%
  arrange(id)

Returns:

# A tibble: 12 x 3
      id  time status
   <dbl> <dbl> <chr>
 1     2     3 mar
 2     2     7 div
 3     2     8 c
 4     2     9 mar
 5     3     2 mar
 6     3     8 div
 7     3    10 mar
 8     3    14 c
 9     3    18 div
10     4     4 mar
11     4     7 c
12     4     9 mar

Upvotes: 2

Andy Eggers

Reputation: 612

I would delete rows in which the status is unchanged by creating a lag_status variable in grouped data:

> myd %>% 
+     arrange(id, time) %>% 
+     group_by(id) %>% 
+     mutate(lag_status = lag(status)) %>%
+     ungroup() %>% 
+     filter(is.na(lag_status) | status != lag_status) %>% 
+     select(-lag_status)
# A tibble: 12 x 3
      id  time status
   <dbl> <dbl> <fct> 
 1     2     3 mar   
 2     2     7 div   
 3     2     8 c     
 4     2     9 mar   
 5     3     2 mar   
 6     3     8 div   
 7     3    10 mar   
 8     3    14 c     
 9     3    18 div   
10     4     4 mar   
11     4     7 c     
12     4     9 mar

Upvotes: 1

Delete rows with redundant information in r (not just duplicates)

Answers (4)

Related Questions