Reputation: 312
In my reprex below:
RSA
is the output of a process that is to be analyzed and whose results is to be grouped.RSA
group has varying range of days (datenum
) each is observed.var1
varies less frequently, but each is observed for the same 8 days consecutively.RSA
groups are to be numbered sequentially within the var1
group; when a new var1
is encountered the RSA
group numbering begins anew.idx_objective
is the index that I am looking for.Reprex:
var1 <- c("aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "bbb", "bbb", "bbb", "bbb", "bbb", "bbb", "bbb", "bbb", "ccc", "ccc", "ccc", "ccc", "ccc", "ccc", "ccc", "ccc", "ddd", "ddd", "ddd", "ddd", "ddd", "ddd", "ddd", "ddd")
RSA <- c(1,1,1,0,-1,-1,0,-1,
0,0,0,-1,-1,-1,1,1,
-1,-1,0,1,1,-1,-1,1,
1,-1,-1,1,1,0,-1,1)
idx_objective <- c(1,1,1,2,3,3,4,5,
1,1,1,2,2,2,3,3,
1,1,2,3,3,4,4,5,
1,2,2,3,3,4,5,6)
objective.df <- data.frame(var1, RSA, idx_objective) %>%
group_by(var1) %>%
mutate (datenum = 1:n()) %>%
relocate (datenum, .after = var1)
I have reviewed many SO posts that appear to be similar...
1dplyr: group variables then assign unique names based on unique grouping
revolves around correct use of cumsum, which I think I am using correctly
[https://stackoverflow.com/questions/40519129/how-to-assign-unique-id-for-group-of-duplicates]
[2]How to divide between groups of rows using dplyr
The last two don't seem applicable; two others referenced in the following:
Approach #1: using a change flag and cumsum
objective.try1 <- objective.df %>%
group_by(var1) %>%
mutate(chg_flg = ifelse(lag(RSA) != RSA, 1, 0) %>%
coalesce(0)) %>%
relocate(chg_flg, .after = RSA) %>%
relocate (datenum, .after var1) %>%
group_by(var1, chg_flg) %>%
mutate (idx_objective_try = cumsum(chg_flg) +1) %>%
Results:
objective.try1 <- c(1, 1, 1, 2, 3, 1, 4, 5, 1, 1, 1, 2, 1, 1, 3, 1, 1, 2, 3, 1, 4, 1, 5, 1, 2, 1, 3, 1, 4, 5, 6)
objective.df <- data.frame(var1, RSA, idx_objective, objective.try1 %>%
group_by(var1) %>%
mutate (datenum = 1: n()) %>%
relocate(datenum, .after = var1)
Observation for objective.try1
: rows 1-5 work, but row 6 incorrectly restarts the idx
numbering over again, but then resumes correctly reflecting the chg_flg
until rows 13 and 14 at which time the idx
numbering is again incorrectly restarted, but then resumes again being correct for one row until being incorrect again at rows 16, 21, 23, 27, and 29.
Following the logic at row 6, for example -- the previous idx_objective_try
(row 5) is 3 and the chg_flg
value at row 6 is zero, so the idx_objecitve_try
ought to be the correct value of 3. Why isn't it?
Approach #2: Using match
and duplicated
:
objective.try2 <- objective.df %>%
group_by(var1) %>%. # var1 corresponds to "prop" in the SO post (both the slower moving variables)
mutate(well_rep1 = match(RSA, unique(RSA)), # "RSA" corresponds to "well" in the SO post (both the faster changing variables)
well_rep2 = cumsum(!duplicated(RSA))) # approach similar to above
Observation for objective.try2
: most rows work, but there again are rows that do not work, though the rows that don't work are different from those in the first try.
I would appreciate it if someone would point out what I am doing wrong.
Upvotes: 1
Views: 284
Reputation: 30474
Focusing on the first group in var1
which is "aaa", you have the following column data for RSA
:
objective.df[objective.df$var1 == "aaa", "RSA"]
RSA
<dbl>
1 1
2 1
3 1
4 0
5 -1
6 -1
7 0
8 -1
If you use the diff
function on this to get differences between one value and the next, you will get 7 differences for the 8 data elements in RSA
for "aaa". In other words, if you have a length "n" vector you will get back "n-1" values in the end (without changing arguments).
Since we want to add a new column new_id_objective
, we will need to include 8 values for this group, not 7. The combine c
will help create the vector for the new column, combining an initial number with the 7 differences from diff
, for a total of 8.
So for "aaa", the diff
will return 0, 0, -1, -1, 0, 1, -1. These values are compared with 0 (!= 0
), with the result of the evaluation being either TRUE or FALSE. The cumsum
of TRUE will increment the counter. In other words, the value of new_id_objective
will increase row-by-row when there is a difference between consecutive row values (up or down, or anything but zero).
In this example, we used combine c
with an initial value of 1. Note that we can use any number that is not zero to start with (while in this case we used 1, we could have used 5, 100, -10, or anything not zero). By doing this, the first evaluation will be TRUE (not equal to zero) and the new_id_objective
counter value will start with 0 + 1 = 1. Otherwise, if you did try to use zero, i.e., c(0, diff(RSA)) != 0
, this will evaluate to FALSE, and the cumulative sum would start with 0, not 1.
Let me know if this helps.
library(tidyverse)
objective.df %>%
group_by(var1) %>%
mutate(new_id_objective = cumsum(c(1, diff(RSA)) != 0))
Output
var1 datenum RSA idx_objective new_id_objective
<chr> <int> <dbl> <dbl> <int>
1 aaa 1 1 1 1
2 aaa 2 1 1 1
3 aaa 3 1 1 1
4 aaa 4 0 2 2
5 aaa 5 -1 3 3
6 aaa 6 -1 3 3
7 aaa 7 0 4 4
8 aaa 8 -1 5 5
9 bbb 1 0 1 1
10 bbb 2 0 1 1
# … with 22 more rows
Upvotes: 1