Reputation: 41
I need to assign numbers to sets of consecutive values in every column and create new columns. Eventually I want to find a sum of values in z column that correspond to the first consecutive numbers in each column.
My data looks something like this:
library(dplyr)
y1 = c(1,2,3,8,9,0)
y2 = c(0,0,0,4,5,6)
z = c(200,250,200,100,90,80)
yabc <- tibble(y1, y2, z)
# A tibble: 6 × 3
y1 y2 z
<dbl> <dbl> <dbl>
1 1 0 200
2 2 0 250
3 3 0 200
4 8 4 100
5 9 5 90
6 0 6 80
I tried the following formula:
yabc %>%
mutate_at(vars(starts_with("y")),
list(mod = ~ cumsum(c(FALSE, diff(.x)!=1))+1))
that gave me the following result:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 1
2 2 0 250 1 2
3 3 0 200 1 3
4 8 4 100 2 4
5 9 5 90 2 4
6 0 6 80 3 4
I am only interested in numbers greater than zero. I tried replacing zeros with NA
, but it did not work either.
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 200 1 1
2 2 NA 250 1 NA
3 3 NA 200 1 NA
4 8 4 100 2 NA
5 9 5 90 2 NA
6 NA 6 80 NA NA
What I would like the data to look like is:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
Is there any way to exclude zeros and start applying the formula only when .x
is greater than 0? Or any other way to make the formula work the way I need? Thank you!
Upvotes: 4
Views: 389
Reputation: 25528
Another possible solution:
library(tidyverse)
y1=c(1,2,3,8,9,0)
y2=c(0,0,0,4,5,6)
z=c(200,250,200,100,90,80)
yabc<-tibble(y1,y2,z)
yabc %>%
mutate(across(starts_with("y"),
~if_else(.x==0, NA_real_, 1+cumsum(c(1,diff(.x)) != 1)), .names="{.col}_mod"))%>%
mutate(across(ends_with("mod"), ~ factor(.x) %>% as.numeric(.)))
#> # A tibble: 6 × 5
#> y1 y2 z y1_mod y2_mod
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 200 1 NA
#> 2 2 0 250 1 NA
#> 3 3 0 200 1 NA
#> 4 8 4 100 2 1
#> 5 9 5 90 2 1
#> 6 0 6 80 NA 1
Upvotes: 0
Reputation: 79338
library(data.table)
library(tidyverse)
yabc %>%
mutate(across(starts_with('y'),
~ as.integer(factor(`is.na<-`(rleid(.x - row_number()), !.x))),
.names = '{col}_mod'))
# A tibble: 6 x 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <int> <int>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
The trick lies in knowing that for consecutive numbers, the difference between the number and their row_number() is the same:
ie consider:
x <- c(1,2,3,6,7,8,10,11,12)
The consecutive numbers can be grouped as:
x - seq_along(x)
[1] 0 0 0 2 2 2 3 3 3
As you can see, the consecutive numbers are grouped together. To get the desired groups, we should use rle
rleid(x-seq_along(x))
[1] 1 1 1 2 2 2 3 3 3
Upvotes: 2
Reputation: 160962
FYI: mutate_at
has been superseded by across
, I'll demonstrate the new method in my code.
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)) + 1L,
NA_integer_) )
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 2
# 5 9 5 90 2 2
# 6 0 6 80 NA 2
If this is sufficient (you don't care if it's 1
or 2
for the first effective group in y2_mod
), then you're good. If you want to reduce them all to be 1-based, then
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)),
NA_integer_))),
across(ends_with("_mod"),
~ if_else(is.na(.x), .x, match(.x, na.omit(unique(.x))))
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 1
# 5 9 5 90 2 1
# 6 0 6 80 NA 1
Notes:
if_else
is helpful to handle the NA
-including rows specially; it requires the same class, which can be annoying/confusing. Because of this, we need to pass the specific "class" of NA
as the false=
(third) argument to if_else
. For example, cumsum(.)+1
produces a numeric
, so the third arg would need to be NA_real_
(since the default NA
is actually logical
). Another way to deal with it is to either use cumsum(.)+1L
(produces an integer) and NA_integer_
or (as I show in my second example) use cumsum(.)
by itself (and NA_integer_
) since we match
things later (and match(.)
returns integer
)mutate_at
to mutate(across(..))
. An important change here from mutate
is that we run across
without assigning its return to anything. In essence, it returns a named-list where each element of the list is an updated column or a new one, depending on the presence of .names
; that takes a glue
-like string to allow for renaming the calculated columns, thereby adding new columns instead of the default action (no .names
) of overwriting the columns in-place. The alternate way of producing new (not in-place) columns is the way you used, with a named list of functions, still a common/supported way to use a list of functions within across(..)
.Upvotes: 2