Reputation: 15
This should be a fairly simple solution but I haven't for the life of me been able to figure out how to solve.
I have a column containing cumulative sums of proportions based on a certain combination of factor levels for my data,
eg.: 0.34, 0.57, 0.66, 0.68, 0.73, 0.81, 0.90, 0.97, 0.98, 0.99,
then the data moves to the next combination of factors and resets the cumulative sum,
eg.: 0.27, 0.29, 0.38, 0.56, 0.78, 0.94, 0.96, 0.97, 0.98 etc.
I calculated this with the following, for combinations of levels of two factors:
DF$CumSum <-ave(DF$Proportion, DF$Factor1, DF$Factor2, FUN = "cumsum")
I want to set a cutoff to subset my data at 0.95 so that I exclude the rows for the increasingly small proportions past this threshold, since the data was arranged in order of proportion before being cumulatively summed. The problem is if I have a set of proportions such as the following,
eg.: 0.58, 0.97, 0.98, 0.99,
then by thresholding at 0.95 I am excluding, in this case 0.39 of proportions, which is a lot, and I would be left with,
eg.: 0.58
Although 0.97 exceeds the threshold (0.95), the previous row is below the threshold (0.58). I would expect any subsequent rows after the first row above 0.95 for the factor level combination to be excluded.
Is there a way to use a different subsetting technique so that in cases like this, the data would be subsetted inclusive of this value? eg. the output would be: 0.58, 0.97 without increasing my cutoff threshold. I have tried implementing an ifelse statement inside of a for loop without success. This is what I came up with:
Threshold <- 0.95
for(i in 2:length(DF$Index)) {
DF[i] <- ifelse(DF$CumSum[i-1] < Threshold & DF$CumSum[i] >= Threshold,
DF[-i,], DF[i])
}
Example data:
Index Factor1 Factor2 Proportion CumSum
1 A A 0.3 0.3
2 A A 0.2 0.5
3 A A 0.1 0.6
4 A A 0.05 0.65
5 A A 0.03 0.68
6 A A 0.01 0.69
7 A B 0.4 0.4
8 A B 0.3 0.7
9 A B 0.2 0.9
10 A B 0.05 0.95
11 A B 0.03 0.98
12 A B 0.002 0.982
13 A B 0.001 0.983
14 B A 0.35 0.35
15 B A 0.2 0.55
16 B A 0.12 0.67
17 B A 0.06 0.73
18 B A 0.04 0.77
19 B B 0.6 0.6
20 B B 0.3 0.9
21 B B 0.08 0.98
Upvotes: 1
Views: 428
Reputation: 25225
finally understood that you only want to exclude if prev value is more than 0.95. the below should work:
DF$CumSum <- ave(DF$Proportion, DF$Factor1, DF$Factor2, FUN = "cumsum")
do.call(rbind, by(DF, paste(DF$Factor1, DF$Factor2), function(subDF) {
subDF[c(TRUE, (dplyr::lag(subDF$CumSum) <= 0.95)[-1]),]
}))
# Index Factor1 Factor2 Proportion CumSum
#A A.1 1 A A 0.30 0.30
#A A.2 2 A A 0.20 0.50
#A A.3 3 A A 0.10 0.60
#A A.4 4 A A 0.05 0.65
#A A.5 5 A A 0.03 0.68
#A A.6 6 A A 0.01 0.69
#A B.7 7 A B 0.40 0.40
#A B.8 8 A B 0.30 0.70
#A B.9 9 A B 0.20 0.90
#A B.10 10 A B 0.05 0.95
#B A.14 14 B A 0.35 0.35
#B A.15 15 B A 0.20 0.55
#B A.16 16 B A 0.12 0.67
#B A.17 17 B A 0.06 0.73
#B A.18 18 B A 0.04 0.77
#B B.19 19 B B 0.60 0.60
#B B.20 20 B B 0.30 0.90
#B B.21 21 B B 0.08 0.98
Df <- read.table(text="Index Factor1 Factor2 Proportion CumSum
1 A A 0.3 0.3
2 A A 0.2 0.5
3 A A 0.1 0.6
4 A A 0.05 0.65
5 A A 0.03 0.68
6 A A 0.01 0.69
7 A B 0.4 0.4
8 A B 0.3 0.7
9 A B 0.2 0.9
10 A B 0.05 0.95
11 A B 0.03 0.98
12 A B 0.002 0.982
13 A B 0.001 0.983
14 B A 0.35 0.35
15 B A 0.2 0.55
16 B A 0.12 0.67
17 B A 0.06 0.73
18 B A 0.04 0.77
19 B B 0.6 0.6
20 B B 0.3 0.9
21 B B 0.08 0.98", header=TRUE)
Upvotes: 0