David Maij
David Maij

Reputation: 53

remove cases following certain other cases

I have a dataframe, say

df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"), 
                y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))

I want to remove only those rows in which one or multiple ts are directly in between a d and a c, in all other cases I want to retain the cases. So for this example, I would like to remove the ts on row 8, 18 and 19, but keep the others. I have over thousands of cases so doing this manually would be a true horror. Any help is very much appreciated.

Upvotes: 1

Views: 68

Answers (3)

jogo
jogo

Reputation: 12559

Here is another solution with base R:

df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"), 
                y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))

#
s <- paste0(df$x, collapse="")
L <- c(NA, NA)
while (TRUE) {
  r <- regexec("dt+c", s)[[1]]
  if (r[1]==-1) break
  L <- rbind(L, c(pos=r[1]+1, length=attr(r, "match.length")-2))
  s <- sub("d(t+)c", "x\\1x", s)
}
L <- L[-1,]
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
# > drop
# 8 18 19 
# > df[-drop, ]
# x y
# 1  a 2
# 2  a 4
# 3  b 5
# 4  b 2
# 5  b 6
# 6  c 2
# 7  d 4
# 9  c 2
# 10 b 6
# 11 t 2
# 12 c 4
# 13 t 5
# 14 a 2
# 15 a 6
# 16 b 2
# 17 d 4
# 20 c 6

With gregexpr() it is shorter:

s <- paste0(df$x, collapse="")
g <- gregexpr("dt+c", s)[[1]]
L <- data.frame(pos=g+1, length=attr(g, "match.length")-2)
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]

Upvotes: 0

Andrew Gustar
Andrew Gustar

Reputation: 18425

This also works, by collapsing to a string, identifying groups of t's between d and c (or c and d - not sure whether you wanted this option as well), then working out where they are and removing the rows as appropriate.

df =        data.frame(x=c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
                y=c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6),stringsAsFactors = FALSE)

dfs <- paste0(df$x,collapse="") #collapse to a string
dfs2 <- do.call(rbind,lapply(list(gregexpr("dt+c",dfs),gregexpr("ct+d",dfs)),
                function(L) data.frame(x=L[[1]],y=attr(L[[1]],"match.length"))))
dfs2 <- dfs2[dfs2$x>0,] #remove any -1 values (if string not found)
drop <- unlist(mapply(function(a,b) (a+1):(a+b-2),dfs2$x,dfs2$y))
df2 <- df[-drop,]

Upvotes: 1

Mike H.
Mike H.

Reputation: 14360

One option would be to use rle to get runs of the same string and then you can use an sapply to check forward/backward and return all the positions you want to drop:

rle_vals <- rle(as.character(df$x))

drop <- unlist(sapply(2:length(rle_vals$values), #loop over values
                      function(i, vals, lengths) {
                        if(vals[i] == "t" & vals[i-1] == "d" & vals[i+1] == "c"){#Check if value is "t", previous is "d" and next is "c"
                          (sum(lengths[1:i-1]) + 1):sum(lengths[1:i]) #Get row #s
                        }
                      },vals = rle_vals$values, lengths = rle_vals$lengths))

drop
#[1]  8 18 19

df[-drop,]
#   x y
#1  a 2
#2  a 4
#3  b 5
#4  b 2
#5  b 6
#6  c 2
#7  d 4
#9  c 2
#10 b 6
#11 t 2
#12 c 4
#13 t 5
#14 a 2
#15 a 6
#16 b 2
#17 d 4
#20 c 6

Upvotes: 1

Related Questions