Sebastian Zeki
Sebastian Zeki

Reputation: 6874

How to remove row based on condition of row above or below

I have a dataframe as follows:

     chr   leftPos        ZScore1    ZScore2    ZScore3    ZScore4
      1     24352           34         43          19         43
      1     53534           2          1           -1         -9
      2      34            -15         7           -9         -18
      3     3443           -100        -4          4          -9
      3     3445           -100        -1          6          -1
      3     3667            5          -5          9           5
      3      7882          -8          -9          1           3

I would like to only keep those rows which have the same chr and have adjacent columns with the ZScore going in the same direction. In other words a row should remain if the row before or after for that chr has the same sign (positive or negative). I would like this to run for all columns with ZS in the column name so that the output ends up just being a number of rows that fulfil the criteria for each row.

For one column the code should result in:

     chr   leftPos         ZScore
      1     24352           34
      1     53534           2
      3     3443           -100
      3     3445           -100

but the final output should look like

         ZScore1    ZScore2    ZScore3    ZScore4
nrow        4         6          4          4


 I have tried bits of code but Im not even really sure how to approach this.

I guess I would group by chr and then see if the row above was the same positive or negative as the current row and then see if the row below was the same direction as the current row. Then move to the next row for that chr.

Upvotes: 5

Views: 3320

Answers (4)

akrun
akrun

Reputation: 887058

An option using the devel version of data.table (similar to the approach in @dimitris_ps post). Instructions to install the devel version are here

library(data.table)#v1.9.5
na.omit(setDT(df)[, {tmp= sign(ZScore)
  .SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')] },
             by=chr])
#     chr leftPos ZScore
#1:   1   24352     34
#2:   1   53534      2
#3:   3    3443   -100
#4:   3    3445   -100

Update

We could create a function

 f1 <- function(dat, ZCol){
    na.omit(as.data.table(dat)[, {tmp = sign(eval(as.name(ZCol)))
     .SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')]},
    by=chr])[, list(.N)]}

 nm1 <- paste0('ZScore', 1:4)
 setnames(do.call(cbind,lapply(nm1, function(x) f1(df1, x))), nm1)[]
 #   ZScore1 ZScore2 ZScore3 ZScore4
 #1:       4       6       4       4

Or we can use set

 res <- as.data.table(matrix(0, ncol=4, nrow=1, 
                  dimnames=list(NULL, nm1)))
 for(j in seq_along(nm1)){
   set(res, i=NULL, j=j, value=f1(df1,nm1[j]))
  }
 res
 #   ZScore1 ZScore2 ZScore3 ZScore4
 #1:       4       6       4       4

data

df <- structure(list(chr = c(1L, 1L, 2L, 3L, 3L, 3L, 3L),
leftPos = c(24352L, 
53534L, 34L, 3443L, 3445L, 3667L, 7882L), ZScore = c(34L, 2L, 
-15L, -100L, -100L, 5L, -8L)), .Names = c("chr", "leftPos", "ZScore"
), class = "data.frame", row.names = c(NA, -7L))

 df1 <- structure(list(chr = c(1L, 1L, 2L, 3L, 3L, 3L, 3L),
 leftPos = c(24352L, 
 53534L, 34L, 3443L, 3445L, 3667L, 7882L), ZScore1 = c(34L, 2L, 
 -15L, -100L, -100L, 5L, -8L), ZScore2 = c(43L, 1L, 7L, -4L, -1L, 
 -5L, -9L), ZScore3 = c(19L, -1L, -9L, 4L, 6L, 9L, 1L),
 ZScore4 = c(43L, 
 -9L, -18L, -9L, -1L, 5L, 3L)), .Names = c("chr", "leftPos",
  "ZScore1", "ZScore2", "ZScore3", "ZScore4"), class = "data.frame",
  row.names = c(NA, -7L))

Upvotes: 3

David Arenburg
David Arenburg

Reputation: 92282

Here's a possible data.table solution which uses rleid from the dev version

setDT(df)[, indx := .N, by = .(chr, rleid(sign(ZScore)))][indx > 1L]
#    chr leftPos ZScore indx
# 1:   1   24352     34    2
# 2:   1   53534      2    2
# 3:   3    3443   -100    2
# 4:   3    3445   -100    2

Edit (per new data)

indx <- paste0('ZScore', 1:4)
temp <- setDT(df)[, lapply(.SD, function(x) rleid(sign(x))), .SDcols = indx, by = chr]

Res <- setNames(numeric(length(indx)), indx)
for (i in indx) Res[i] <- length(temp[, .I[.N > 1L], by = c("chr", i)]$V1)
Res
# ZScore1 ZScore2 ZScore3 ZScore4 
#       4       6       4       4 

Upvotes: 3

Mark
Mark

Reputation: 4537

This is a loop which does what you want. No fancy packages. Just check if the row behind and the row ahead match - if they do, move on, otherwise, strip the row and check the same position.

chr = c(1,1,2,3,3,3,3,3)
mat = cbind(chr,rnorm(8))

i = 1
while(i <= nrow(mat)){
  if (mat[max(i-1,1),1] != mat[i,1] & mat[min(nrow(mat),i+1),1] != mat[i,1]){
    mat = mat[-i,]
  } else {
    i = i+1
  }
}

Upvotes: 0

dimitris_ps
dimitris_ps

Reputation: 5951

Try this with the package dplyr

library(dplyr)

Data

df <- data.frame(chr=c(1, 1, 2, 3, 3, 3, 3),
             leftPos=c(24352, 53534, 34, 3443, 3445, 3667, 7882),
             ZScore=c(34, 2, -15, -100, -100, 5, -8))

Code

df %>% group_by(chr) %>% 
   filter(sign(ZScore)==sign(lag(ZScore)) | sign(ZScore)==sign(lead(ZScore))) %>% 
   ungroup

Upvotes: 7

Related Questions