Matthew Holland
Matthew Holland

Reputation: 15

Subset dataframe with inclusive cutoff rather than exclusive

This should be a fairly simple solution but I haven't for the life of me been able to figure out how to solve.

I have a column containing cumulative sums of proportions based on a certain combination of factor levels for my data,

eg.: 0.34, 0.57, 0.66, 0.68, 0.73, 0.81, 0.90, 0.97, 0.98, 0.99,

then the data moves to the next combination of factors and resets the cumulative sum,

eg.: 0.27, 0.29, 0.38, 0.56, 0.78, 0.94, 0.96, 0.97, 0.98 etc.

I calculated this with the following, for combinations of levels of two factors:

DF$CumSum <-ave(DF$Proportion, DF$Factor1, DF$Factor2, FUN = "cumsum")

I want to set a cutoff to subset my data at 0.95 so that I exclude the rows for the increasingly small proportions past this threshold, since the data was arranged in order of proportion before being cumulatively summed. The problem is if I have a set of proportions such as the following,

eg.: 0.58, 0.97, 0.98, 0.99,

then by thresholding at 0.95 I am excluding, in this case 0.39 of proportions, which is a lot, and I would be left with,

eg.: 0.58

Although 0.97 exceeds the threshold (0.95), the previous row is below the threshold (0.58). I would expect any subsequent rows after the first row above 0.95 for the factor level combination to be excluded.

Is there a way to use a different subsetting technique so that in cases like this, the data would be subsetted inclusive of this value? eg. the output would be: 0.58, 0.97 without increasing my cutoff threshold. I have tried implementing an ifelse statement inside of a for loop without success. This is what I came up with:

Threshold <- 0.95

for(i in 2:length(DF$Index)) {

    DF[i] <- ifelse(DF$CumSum[i-1] < Threshold & DF$CumSum[i] >= Threshold, 
    DF[-i,], DF[i])

  }

Example data:

Index   Factor1 Factor2 Proportion  CumSum
 1        A        A       0.3       0.3
 2        A        A       0.2       0.5
 3        A        A       0.1       0.6
 4        A        A       0.05      0.65
 5        A        A       0.03      0.68
 6        A        A       0.01      0.69
 7        A        B       0.4       0.4
 8        A        B       0.3       0.7
 9        A        B       0.2       0.9
 10       A        B       0.05      0.95
 11       A        B       0.03      0.98
 12       A        B       0.002     0.982
 13       A        B       0.001     0.983
 14       B        A       0.35      0.35
 15       B        A       0.2       0.55
 16       B        A       0.12      0.67
 17       B        A       0.06      0.73
 18       B        A       0.04      0.77
 19       B        B       0.6       0.6
 20       B        B       0.3       0.9
 21       B        B       0.08      0.98

Upvotes: 1

Views: 428

Answers (1)

chinsoon12
chinsoon12

Reputation: 25225

finally understood that you only want to exclude if prev value is more than 0.95. the below should work:

DF$CumSum <- ave(DF$Proportion, DF$Factor1, DF$Factor2, FUN = "cumsum")
do.call(rbind, by(DF, paste(DF$Factor1, DF$Factor2), function(subDF) {
    subDF[c(TRUE, (dplyr::lag(subDF$CumSum) <= 0.95)[-1]),]
}))

#       Index Factor1 Factor2 Proportion CumSum
#A A.1      1       A       A       0.30   0.30
#A A.2      2       A       A       0.20   0.50
#A A.3      3       A       A       0.10   0.60
#A A.4      4       A       A       0.05   0.65
#A A.5      5       A       A       0.03   0.68
#A A.6      6       A       A       0.01   0.69
#A B.7      7       A       B       0.40   0.40
#A B.8      8       A       B       0.30   0.70
#A B.9      9       A       B       0.20   0.90
#A B.10    10       A       B       0.05   0.95
#B A.14    14       B       A       0.35   0.35
#B A.15    15       B       A       0.20   0.55
#B A.16    16       B       A       0.12   0.67
#B A.17    17       B       A       0.06   0.73
#B A.18    18       B       A       0.04   0.77
#B B.19    19       B       B       0.60   0.60
#B B.20    20       B       B       0.30   0.90
#B B.21    21       B       B       0.08   0.98

data:

Df <- read.table(text="Index   Factor1 Factor2 Proportion  CumSum
1        A        A       0.3       0.3
2        A        A       0.2       0.5
3        A        A       0.1       0.6
4        A        A       0.05      0.65
5        A        A       0.03      0.68
6        A        A       0.01      0.69
7        A        B       0.4       0.4
8        A        B       0.3       0.7
9        A        B       0.2       0.9
10       A        B       0.05      0.95
11       A        B       0.03      0.98
12       A        B       0.002     0.982
13       A        B       0.001     0.983
14       B        A       0.35      0.35
15       B        A       0.2       0.55
16       B        A       0.12      0.67
17       B        A       0.06      0.73
18       B        A       0.04      0.77
19       B        B       0.6       0.6
20       B        B       0.3       0.9
21       B        B       0.08      0.98", header=TRUE)

Upvotes: 0

Related Questions