Reputation: 489
I am trying to use dplyr
's mutate()
function to create new variables that depend on the previous row values of succeeding new variables.
I've searched SO with different terms to see if something clicks, and the closest that I've come is this answer. This is a rough structure of the tibble tib
I have:
library(dplyr)
library(magrittr)
tib <- tribble(
~ID,
"A1",
"A2",
"A3",
"A4",
"A5",
"A1",
"B1",
"B2",
"B3"
)
I want to use mutate()
to be able to generate the columns x
, y
and z
:
tib %<>%
mutate(
x = if_else(ID == "A1", 2, lag(y) + lag(z)),
y = if_else(ID == "A1", 3, x + lag(z)),
z = if_else(ID == "A1", 7, lag(z))
)
For instance, for the values shown above, I would want the output to be:
| ID | x | y | z |
--------------------------------
| A1 | 2 | 3 | 7 |
| A2 | 10 | 17 | 7 |
| A3 | 24 | 31 | 7 |
| A4 | 38 | 45 | 7 |
| A5 | 52 | 59 | 7 |
| A1 | 2 | 3 | 7 |
| B1 | 10 | 17 | 7 |
| B2 | 24 | 31 | 7 |
| B3 | 38 | 45 | 7 |
--------------------------------
The problem with this method is that mutate()
returns an error:
Error in lag(y) : object 'y' not found
I understand that I'm referring to y
and z
before I initiate them. If x
was something that depended only on itself as in the linked question above, I could've done it in multiple passes as suggested there, but I don't think that's possible here.
As suggested in the comments to my question, if I try to initialize those values (so that y
and z
are known), as shown below,
tib %<>%
mutate(
x = if_else(ID == "A1", 2, 0),
y = if_else(ID == "A1", 3, 0),
z = if_else(ID == "A1", 7, 0)
)
tib %<>%
mutate(
x = if_else(ID == "A1", 2, lag(y) + lag(z)),
y = if_else(ID == "A1", 3, x + lag(z)),
z = if_else(ID == "A1", 7, lag(z))
)
The tibble that I get is as follows:
# A tibble: 9 x 4
ID x y z
<chr> <dbl> <dbl> <dbl>
1 A1 2 3 7
2 A2 10 17 7
3 A3 0 0 0
4 A4 0 0 0
5 A5 0 0 0
6 A1 2 3 7
7 B1 10 17 7
8 B2 0 0 0
9 B3 0 0 0
which is different from what I expect to get (perhaps because mutate()
evaluates all variables column-wise so y
and z
are 0
)?
To hopefully put it in clearer terms, I want to be able to calculate values for each of the new columns based on the values present in the previous rows of said columns - they will always have some row where they do get initial values, but how do I make it flow to the lower rows?
If it helps, this is how I would want it to work in Excel (I'm just beginning to learn R).
I would like to continue using dplyr
to maintain consistency with my code.
Upvotes: 2
Views: 2291
Reputation: 206232
mutate()
doesn't really work with iterated formulas like Excel. It operates column-wise so having communication between each row iteration isn't easy. In this particular case, your function have simple non-recursive defintions. Here's a wrapper function that encapsulates the non-iterative versions
my_mutate <- function(data, x0, y0, z0) {
mutate(data,
n = 1:n(),
x = if_else(n==1, x0, y0 + z0*(n-1)),
y = if_else(n==1, y0, y0 + z0*2*(n-1)),
z = z0,
n = NULL
)
}
Then we can perform the within-group transformations with
tib %>% group_by(grp=cumsum(ID=="A1")) %>%
my_mutate(x0=2, y0=3, z0=7) %>%
ungroup %>% select(-grp)
# ID x y z
# <chr> <dbl> <dbl> <dbl>
# 1 A1 2 3 7
# 2 A2 10 17 7
# 3 A3 17 31 7
# 4 A4 24 45 7
# 5 A5 31 59 7
# 6 A1 2 3 7
# 7 B1 10 17 7
# 8 B2 17 31 7
# 9 B3 24 45 7
It's much easier to with with non-iterative definitions when they exist.
Upvotes: 3