Reputation: 2989

Delete rows after a certain sequence of values in a certain column

a <- c("A","A","A","B","B","B","C","C","C","C","D","D","D","D","D")
b <- c("x","y","z","x","x","z","y","z","z","z","y","z","z","z","x")
df = data.frame(a,b)


    a   b
1   A   x
2   A   y
3   A   z
4   B   x
5   B   x
6   B   z
7   C   y
8   C   z
9   C   z
10  C   z
11  D   y
12  D   z
13  D   z
14  D   z
15  D   x

For every group A, B, C, D, I'd like to delete the value z in column b every time the combination y,z appears at the end of the group.

If we have the case of a=="C", where the b-values are y,z,z,z, I'd like to delete all z's. However, in a=="D", nothing has to change as x is the last value.

The results looks like this:

    a   b
1   A   x
2   A   y
4   B   x
5   B   x
6   B   z
7   C   y
11  D   y
12  D   z
13  D   z
14  D   z
15  D   x

By grouping in dplyr, I can identify the last occurence of each value in A, so the basic case depictured in a=="A"is not a problem. I have trouble finding a solution for the case of a=="C", where I could have one occurence of y followed by 20 occurences of z.

Upvotes: 4

Answers (4)

David Arenburg

Reputation: 92282

Here's a possible data.table solution. Basically, I'm creating an logical index that satisfies 3 conditions at once: being a z, that the first z comes after y and that the last value is z and then I'm just evaluating it.

library(data.table)
setDT(df)[, indx := b == "z" & 
            max(which(b == "z")) == .N &
            ifelse(min(which(b == "z")) == 1L, 
                   TRUE, 
                   b[min(which(b == "z")) - 1L] == "y"),
            by = a][!(indx)]
#     a b  indx
#  1: A x FALSE
#  2: A y FALSE
#  3: B x FALSE
#  4: B x FALSE
#  5: B z FALSE
#  6: C y FALSE
#  7: D y FALSE
#  8: D z FALSE
#  9: D z FALSE
# 10: D z FALSE
# 11: D x FALSE

Upvotes: 3

Sven Hohenstein

Reputation: 81683

You can use by and cummin in base R:

df[unlist(by(df$b, interaction(df$a), FUN = function(x) {
  tmp <- rev(cummin(rev(x == "z")))
  if (tail(x[!tmp], 1) == "y") !tmp else rep(TRUE, length(x))
})), ]

The result:

   a b
1  A x
2  A y
4  B x
5  B x
6  B z
7  C y
11 D y
12 D z
13 D z
14 D z
15 D x

Upvotes: 4

statespace

Reputation: 1664

Not so efficient, but works fine:

require(stringr)
df2 <- data.frame(row.names = c("a", "b")) 

for(i in levels(factor(df$a))) {
  temp <- paste(df$b[df$a == i], collapse = "")
  if(str_detect(temp, "yz") & str_detect(temp, "z$")) {
    temp <- gsub("z", "", temp)
    df2 <- rbind(df2, data.frame(a = rep(i, nchar(temp)), b = substring(temp, seq(1,nchar(temp),1), seq(1,nchar(temp),1))))
  } else df2 <- rbind(df2, data.frame(a = rep(i, nchar(temp)), b = substring(temp, seq(1,nchar(temp),1), seq(1,nchar(temp),1))))   
}

#    a b
# 1  A x
# 2  A y
# 3  B x
# 4  B x
# 5  B z
# 6  C y
# 7  D y
# 8  D z
# 9  D z
# 10 D z
# 11 D x

Upvotes: 1

Thomas

Reputation: 44525

Here's a base solution:

do.call("rbind", by(df, df$a, FUN = function(x) {
    if(x$b[length(x$b)] == "z") {
        y <- which(x$b == "y")
        if(!length(y)) {
            return(x)
        }
        z <- which(x$b == "z")
        if(!length(z)) {
            return(x)
        }
        # check if y isn't immediately before z
        if(max(y) - min(z) > 1) {
            return(x)
        } else {
            return(x[-z,])
        }
    } else {
        return(x)
    }
}))

And the result:

     a b
A.1  A x
A.2  A y
B.4  B x
B.5  B x
B.6  B z
C    C y
D.11 D y
D.12 D z
D.13 D z
D.14 D z
D.15 D x

Upvotes: 1

Delete rows after a certain sequence of values in a certain column

Answers (4)

Related Questions