Reputation: 13
library(tidyverse)
library(magrittr)
df <- data.frame(year = c(1977:1981), set852 = c(1,1,0,0,0), set857=c(0,0,1,1,0), set874=c(0,0,0,1,1))
For each variable set852, set857 and so forth (in the real datasets it's a long list) I want to create a variable that indicates whether there is a change in the time series (values would be "start", "end" and "no change"). The additional variables should look like this:
df_final <- data.frame(year = c(1977:1981), c852 = c("start","end","no change","no change","no change"), c857=c("no change","no change","start","end","no change"), c874=c("no change","no change","no change","start","end"))
I tried this within the tidyverse with a for-loop, mutate, paste and case_when:
set_num <- as.integer(str_extract(colnames(df), "[0-9]+"))
for (i in 2:nrow(df))
{
df %<>% mutate(paste0("c", set_num[[i]]) = case_when(paste("set", set_num[[i]], sep="")==1 & year == 1977 ~ "start",
paste("set", set_num[[i]], sep="")==1 & lag(paste("set", set_num[[i]], sep=""))==0 ~ "start",
paste("set", set_num[[i]], sep="")==1 & lead(paste("set", set_num[[i]], sep=""))==0 ~ "end",
TRUE~"no change"))
}
However, the paste-function after mutate is not recognized as a function but as the name of a variable that starts with "paste0("c"....and so forth". How do I get the code to register the paste0-function as a function and not as a string?
Edit: There seems to be confusion about what constitutes a change. A sequence of 1-1-1-0-0 would be start-nochange-end-nochange-nochange
Upvotes: 1
Views: 82
Reputation: 72919
You could use matrixStats::rowCumsums
. The advantage is, row calculations are done in C++ which is much faster. We use modulo %% (length(v) - 1)
add one and replace
the 0
with length(v)
to subset our v
alue vector. Finally we inject an array
with the original dim
ensions into our data frame. Using more interesting data:
> v <- c('end', 'start', 'no change')
> l <- length(v)
> df[-1] <- array(v[
+ replace(
+ matrixStats::rowCumsums(as.matrix(df[-1])) %% (l - 1) + 1,
+ df[-1] == 0,
+ l)
+ ], dim=dim(df[-1]))
> df
year set852 set857 set874 set852.1 set853 set8744
1 1977 start no change no change end no change no change
2 1978 start no change no change end no change no change
3 1979 no change start no change no change end no change
4 1980 no change start end no change start end
5 1981 no change no change start no change no change end
Reviewing other other answers, there seems to be confusion about your logic. I assumed 1
indicates a change, accordingly a row e.g. 0-1-0-1-0
should be "no change"-"start"-"no change"-"end"-"no change"
.
Data:
> dput(df)
structure(list(year = 1977:1981, set852 = c(1L, 1L, 0L, 0L, 0L
), set857 = c(0L, 0L, 1L, 1L, 0L), set874 = c(0L, 0L, 0L, 1L,
1L), set852.1 = c(1L, 1L, 0L, 0L, 0L), set853 = c(0L, 0L, 1L,
1L, 0L), set8744 = c(0L, 0L, 0L, 1L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
looks like:
> df
year set852 set857 set874 set852.1 set853 set8744
1 1977 1 0 0 1 0 0
2 1978 1 0 0 1 0 0
3 1979 0 1 0 0 1 0
4 1980 0 1 1 0 1 1
5 1981 0 0 1 0 0 1
Upvotes: 0
Reputation: 6911
another approach with base R:
get_states <- \(xs){
(rle(xs))$lengths |>
Map(f = \(len) rep('no change', len) |>
replace(len, 'end') |>
replace(1, 'start')
) |>
Reduce(f = c)
}
df_final <- cbind(df[1],
df[-1] |>
Map(f = get_states)
)
## > df
## year set852 set857 set874
## 1 1977 1 0 0
## 2 1978 1 0 0
## 3 1979 0 1 0
## 4 1980 0 1 1
## 5 1981 0 0 1
## > df_final
## year set852 set857 set874
## 1 1977 start start start
## 2 1978 end end no change
## 3 1979 start start end
## 4 1980 no change end start
## 5 1981 end start end
Upvotes: 0
Reputation: 124213
Instead of a for loop you could achieve your desired result using dplyr::across
like so:
library(dplyr, warn = FALSE)
df <- data.frame(
year = c(1977:1981),
set852 = c(1, 1, 0, 0, 0),
set857 = c(0, 0, 1, 1, 0),
set874 = c(0, 0, 0, 1, 1)
)
myfun <- function(.x, year) {
case_when(
.x == 1 & year == 1977 ~ "start",
.x == 1 & lag(.x) == 0 ~ "start",
.x == 1 & lead(.x) == 0 ~ "end",
.default = "no change"
)
}
set_cols <- grep("\\d+$", names(df), value = TRUE)
df |>
mutate(
across(all_of(set_cols), ~ myfun(.x, year),
.names = "{gsub('^.*?(\\\\d+)$', 'c\\\\1', .col)}"
)
) |>
select(-all_of(set_cols))
#> year c852 c857 c874
#> 1 1977 start no change no change
#> 2 1978 end no change no change
#> 3 1979 no change start no change
#> 4 1980 no change end start
#> 5 1981 no change no change no change
Upvotes: 1