Reputation:

Select sub-dataframe between two values

I have a data frame object, where the first column is mass and the second is abundance.

dd <- read.table(text = "771.55 0
                  772.35 0 
                 772.9 10 
                 773.81 0
                 885.64 0 
                 885.65 110 
                 885.68 313 
                 885.70 313 
                 885.78 71 
                 885.82 0
                 889.12 0
                 889.13 506 
                 885.82 0
                 900.31 0 
                 900.34 10 
                 901.22 1901 
                 902.8 0
                 908.8 0")

I have to select only sub-dataframe with consecutive zero value (starting with 0 and ending with 0) in the second column with abundance (second column value) > 100. The result must be:

list1 <- read.table(text= "885.64 0
                          885.65 110 
                          885.68 313 
                          885.70 313 
                          885.78 71 
                          885.82 0")

list3 <- read.table(text= "889.12 0
                          889.13 506 
                          885.82 0")

...etc

Someone suggested this solution:

dd[!!ave(dd$V2, c(0, cumsum(diff(dd$V2) == 0)), FUN = function(x) any(x > 100)), ]

It works very well, but it cuted also when their is duplicate abundance value. Instead of cutting:

list <- read.table(text= "885.64 0 
             885.65 110 
             885.68 313 
             885.70 313 
             885.78 71 
             885.82 0")

It cuts wrongly in the middle of the serie:

list <- read.table(text= "885.64 0 
                 885.65 110 
                 885.68 313")

list <- read.table(text= "885.70 313 
                 885.78 71 
                 885.82 0")

Upvotes: 1

Answers (3)

jogo

Reputation: 12569

Here is a solution with data.table to construct a grouping variable:

library("data.table")
dt <- fread(
  "x y
771.55 0
772.35 0 
772.9 10 
773.81 0
885.64 0 
885.65 10 
885.68 313 
885.70 313 
885.78 71 
885.82 0
889.12 0
889.13 506 
885.82 0
900.31 0 
900.34 10 
901.22 1901 
902.8 0
908.8 0")
dt[, ':='(y2=shift(y), y3=shift(y, type="lead"))]
dt[, ':='(start=(y==0 & y3>0), stop=(y==0 & y2>0))]
dt[, group:=(rleid(start, stop)+1)%/%3]
dt[, if (.N>=3 && max(y)>100) .SD[, .(x, y)], group]
# > dt[, if (.N>=3 && max(y)>100) .SD[, .(x, y)], group]
#    group      x    y
# 1:     2 885.64    0
# 2:     2 885.65   10
# 3:     2 885.68  313
# 4:     2 885.70  313
# 5:     2 885.78   71
# 6:     2 885.82    0
# 7:     3 889.12    0
# 8:     3 889.13  506
# 9:     3 885.82    0
# 10:    4 900.31    0
# 11:    4 900.34   10
# 12:    4 901.22 1901
# 13:    4 902.80    0

Here is a short variant:

dt[, group:=rleidv(y==0 & shift(y)==0) %/%2][, if (.N>2 && max(y)>100) .SD, group]

Upvotes: 1

Uwe

Reputation: 42564

Here is another data.table solution which uses a non-equi join and groups within the join:

library(data.table)
# coerce to data.table and append row numbers
setDT(dd)[, rn := .I]
# find start and end indices of subsequences from zero to zero
mdt <- dd[, {tmp = .I[V2 == 0]; .(beg = head(tmp, -1L), end = tail(tmp, -1L))}]
# non-equi join of index ranges and group within the join 
# to return only subsequences which fulfill the condition
result <- dd[mdt, on = .(rn >= beg, rn <= end), .SD[any(V2 > 100)], by = .EACHI][
  # return mass, abundance, and group id
  , .(V1, V2, rleid(rn))]

result

        V1   V2 V3
 1: 885.64    0  1
 2: 885.65  110  1
 3: 885.68  313  1
 4: 885.70  313  1
 5: 885.78   71  1
 6: 885.82    0  1
 7: 889.12    0  2
 8: 889.13  506  2
 9: 885.82    0  2
10: 900.31    0  3
11: 900.34   10  3
12: 901.22 1901  3
13: 902.80    0  3

The grouping variable V3 should be sufficient for further group-wise processing. However, if it is required to separate the sub-data tables:

split(result, by = "V3")

$`1`
       V1  V2 V3
1: 885.64   0  1
2: 885.65 110  1
3: 885.68 313  1
4: 885.70 313  1
5: 885.78  71  1
6: 885.82   0  1

$`2`
       V1  V2 V3
1: 889.12   0  2
2: 889.13 506  2
3: 885.82   0  2

$`3`
       V1   V2 V3
1: 900.31    0  3
2: 900.34   10  3
3: 901.22 1901  3
4: 902.80    0  3

Upvotes: 1

Onyambu

Reputation: 79288

 library(data.table) 
 A=setDT(dd)[,group:=cumsum(c(diff(as.numeric(!V2)),0)<0)][,
              b:=any(V2>100),by=group][!!b][,b:=NULL]

 split(A,A$group)
$`2`
       V1  V2 group
1: 885.64   0     2
2: 885.65 110     2
3: 885.68 313     2
4: 885.70 313     2
5: 885.78  71     2
6: 885.82   0     2

$`3`
       V1  V2 group
1: 889.12   0     3
2: 889.13 506     3
3: 885.82   0     3

$`4`
       V1   V2 group
1: 900.31    0     4
2: 900.34   10     4
3: 901.22 1901     4
4: 902.80    0     4
5: 908.80    0     4

Upvotes: 1

Select sub-dataframe between two values

Answers (3)

Related Questions