Reputation: 5701

How to compare with values adjacent in a sequence in the same group

Let's say I have something like this:

set.seed(0)
the.df <- data.frame( x=rep(letters[1:3], each=4),
                        n=rep(0:3, 3),
                        val=round(runif(12)))
the.df


   x n val
1  a 0   1
2  a 1   0
3  a 2   0
4  a 3   1
5  b 0   1
6  b 1   0
7  b 2   1
8  b 3   1
9  c 0   1
10 c 1   1
11 c 2   0
12 c 3   0

Within each x, starting from n==2 (going from small to large), I want to set val to 0 if the previous val (in terms of n) is 0; otherwise, leave it as is.

For example, in the subset x=="b", I first ignore the two rows where n < 2. Now, in Row 7, because the previous val is 0 (the.df$val[the.df$x=="b" & the.df$n==1]), I set val to 0 (the.df$val[the.df$x=="b" & the.df$n==2] <- 0). Then on Row 8, now that val for the previous n is 0 (we just set it), I also want to set val here to 0 (the.df$val[the.df$x=="b" & the.df$n==3] <- 0).

Imagine that the data.frame is not sorted. Therefore procedures that depend on the order would require a sort. I also can't assume that adjacent rows exist (e.g., the row the.df[the.df$x=="a" & the.df$n==1, ] might be missing).

The trickiest part seems to be evaluating val in sequence. I can do this using a loop but I imagine that it would be inefficient (I have millions of rows). Is there a way I can do this more efficiently?

EDIT: wanted output

the.df

   x n val wanted
1  a 0   1      1
2  a 1   0      0
3  a 2   0      0
4  a 3   1      0
5  b 0   1      1
6  b 1   0      0
7  b 2   1      0
8  b 3   1      0
9  c 0   1      1
10 c 1   1      1
11 c 2   0      0
12 c 3   0      0

Also, I don't mind making new columns (e.g., putting the wanted values there).

Upvotes: 7

Answers (4)

Martin Morgan

Reputation: 46856

A base R approach might be

df <- the.df[order(the.df$x, the.df$n),]
df$val <- ave(df$val, df$x, FUN=fun)

As for fun, @DavidArenburg's answer in plain R and written a bit more poetically might be

fun0 <- function(v) {
    idx <- which.max(v[2:length(v)] == 0L) + 1L
    if (length(idx))
        v[idx:length(v)] <- 0L
    v
}

It seems like a good idea to formulate the solution as an independent function first, because then it is easy to test. fun0 fails for some edge cases, e.g.,

> fun0(0)
[1] 0 0 0
> fun0(1)
[1] 0 0 0
> fun0(c(1, 1))
[1] 1 0

A better version is

fun1 <- function(v) {
    tst <- tail(v, -1) == 0L
    if (any(tst)) {
        idx <- which.max(tst) + 1L
        v[idx:length(v)] <- 0L
    }
    v
}

And even better, following @Arun

fun <- function(v)
    if (length(v) > 2) c(v[1], cummin(v[-1])) else v

This is competitive (same order of magnitude) with the data.table solution, with ordering and return occurring in less than 1s for the ~10m row data.frame of @m-dz 's timings. At a second for millions of rows, it doesn't seem worth while to pursue further optimization.

Nonetheless, when there are a very large number of small groups (e.g., 2M each of size 5) an improvement is to avoid the tapply() function call by using group identity to offset the minimum. For instance,

df <- df[order(df$x, df$n),]
grp <- match(df$x, unique(df$x))    # strictly sequential groups
keep <- duplicated(grp)             # ignore the first of each group
df$val[keep] <- cummin(df$val[keep] - grp[keep]) + grp[keep]

Upvotes: 3

David Arenburg

Reputation: 92282

Using data.table I would try the following

library(data.table)
setDT(the.df)[order(n), 
          val := if(length(indx <- which(val[2:.N] == 0L))) 
            c(val[1:(indx[1L] + 1L)], rep(0L, .N - (indx[1L] + 1L))), 
          by = x]
the.df
#     x n val
#  1: a 0   1
#  2: a 1   0
#  3: a 2   0
#  4: a 3   0
#  5: b 0   1
#  6: b 1   0
#  7: b 2   0
#  8: b 3   0
#  9: c 0   1
# 10: c 1   1
# 11: c 2   0
# 12: c 3   0

This will simultaneously order the data by n (as you said it's not ordered in real life) and recreate val by condition (meaning that if condition not satisfied, val will be untouched).

Hopefully in the near future this will be implemented and then the code could potentially be

setDT(the.df)[order(n), val[n > 2] := if(val[2L] == 0) 0L, by = x]

Which could be a great improvement both performance and syntax wise

Upvotes: 6

m-dz

Reputation: 2362

Hmmm, should be pretty efficient if you switch to data.table...

library(data.table)

# Define the.df as a data.table (or use data.table::setDT() function)
set.seed(0)
the.df <- data.table(
  x = rep(letters[1:3], each = 4),
  n = rep(0:3, 3),
  val = round(runif(12))
)

m_dz <- function() {
  setorder(the.df, x, n)
  repeat{
    # Get IDs of rows to change
    # ids <- which(the.df[, (n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0)])
    ids <- the.df[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
    # If no IDs break
    if(length(ids) == 0){
      break
    }
    # Set val to 0
    # for (i in ids) set(the.df, i = i, j = "val", value = 0)
    set(the.df, i = ids, j = "val", value = 0)
  }
  return(the.df)
}

Edit: Above function is slightly modified thanks to @jangorecki's, i.e. uses `which = TRUE` and `set(the.df, i = ids, j = "val", value = 0)`, which made the timings much more stable (no very high max timings).

Edit: timing comparison with @David Arenburgs's answer on a slightly bigger table, `m-dz()` updated (@FoldedChromatin's answer skipped because of diffrent results).

My function is slightly faster in terms of median and upper quantile, but there is quite a big spread in timings (see max...), I cannot figure out why. Hopefully the timing methodology is correct (returning the result to different object etc.).

Anything bigger will kill my PC :(

set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))

size1 <- length(unique(groups_ids))
size2 <- round(1e7/size1)

the.df1 <- data.table(
  x = rep(groups_ids, each = size2),  # 52 * 500 = 26000
  n = rep(0:(size2-1), size1), 
  val = round(runif(size1*size2))
)

the.df2 <- copy(the.df1)

# m-dz
m_dz <- function() {
  setorder(df1, x, n)
  repeat{
    ids <- df1[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
    if(length(ids) == 0){
      break
    }
    set(df1, i = ids, j = "val", value = 0)
  }
  return(df1)
}

# David Arenburg
DavidArenburg <- function() {
  setorder(df2, x, n)
  df2[, val := if(length(indx <- which.max(val[2:.N] == 0) + 1L)) c(val[1:indx], rep(0L, .N - indx)), by = x]
  return(df2)
}

library(microbenchmark)
microbenchmark(
  res1 <- m_dz(),
  res2 <- DavidArenburg(),
  times = 100
)

# Unit: milliseconds
#                    expr      min       lq     mean   median       uq       max neval cld
#          res1 <- m_dz() 247.4136 268.5005 363.0117 288.4216 312.7307 7071.0960   100   a
# res2 <- DavidArenburg() 270.6074 281.3935 314.7864 303.5229 328.1210  525.8095   100   a

identical(res1, res2)

# [1] TRUE

Edit: (Old) results for even bigger table:

set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))

size1 <- length(unique(groups_ids))
size2 <- round(1e8/size1)

# Unit: seconds
#                     expr      min       lq     mean   median       uq       max neval cld
#           res1 <- m_dz() 5.599855 5.800264 8.773817 5.923721 6.021132 289.85107   100   a
#          res2 <- m_dz2() 5.571911 5.836191 9.047958 5.970952 6.123419 310.65280   100   a
#  res3 <- DavidArenburg() 9.183145 9.519756 9.714105 9.723325 9.918377  10.28965   100   a

Upvotes: 2

FoldedChromatin

Reputation: 217

Why not just use by

> set.seed(0)
> the.df <- data.frame( x=rep(letters[1:3], each=4),
                        n=rep(0:3, 3),
                        val=round(runif(12)))
> the.df
   x n val
1  a 0   1
2  a 1   0
3  a 2   0
4  a 3   1
5  b 0   1
6  b 1   0
7  b 2   1
8  b 3   1
9  c 0   1
10 c 1   1
11 c 2   0
12 c 3   0

> Mod.df<-by(the.df,INDICES=the.df$x,function(x){
    x$val[x$n==2]=0 
    Which=which(x$n==2 & x$val==0)+1 
    x$val[Which]=0 
    x})

> do.call(rbind,Mod.df)
     x n val
a.1  a 0   1
a.2  a 1   0
a.3  a 2   0
a.4  a 3   0
b.5  b 0   1
b.6  b 1   0
b.7  b 2   0
b.8  b 3   0
c.9  c 0   1
c.10 c 1   1
c.11 c 2   0
c.12 c 3   0

Upvotes: 0

How to compare with values adjacent in a sequence in the same group

Answers (4)

Edit: Above function is slightly modified thanks to @jangorecki's, i.e. uses which = TRUE and set(the.df, i = ids, j = "val", value = 0), which made the timings much more stable (no very high max timings).

Edit: timing comparison with @David Arenburgs's answer on a slightly bigger table, m-dz() updated (@FoldedChromatin's answer skipped because of diffrent results).

Edit: (Old) results for even bigger table:

Related Questions

Edit: Above function is slightly modified thanks to @jangorecki's, i.e. uses `which = TRUE` and `set(the.df, i = ids, j = "val", value = 0)`, which made the timings much more stable (no very high max timings).

Edit: timing comparison with @David Arenburgs's answer on a slightly bigger table, `m-dz()` updated (@FoldedChromatin's answer skipped because of diffrent results).