Reputation: 17
Using the dplyr() lag() function I'm finding the difference in 'x' between consecutive rows. For rows that have NA in 'x', the 'diff' is NA. For the next 'diff' the NA row seems to be excluded from the calculation:
df %>%
mutate(diff = x - lag(x))
date | x | diff |
---|---|---|
2021-01-01 | 0 | 0 |
2021-01-02 | 10 | 10 |
2021-01-02 | 30 | 20 |
2021-01-03 | NA | NA |
2021-01-04 | 60 | 30 |
To have a more complete data set I am assuming that between two entries there is a linear increase in 'x' when there are NA entries.
So, I would like to divide the 'diff' following an NA row by 1 + the number of NA rows that were 'skipped'. In this example, 1 NA row was skipped so I would want to divide 30/2 and enter 15 in the diff column for the NA row and the following row, as shown here.
date | x | diff |
---|---|---|
2021-01-01 | 0 | 0 |
2021-01-02 | 10 | 10 |
2021-01-02 | 30 | 20 |
2021-01-03 | NA | 15 |
2021-01-04 | 60 | 15 |
Upvotes: 0
Views: 146
Reputation: 16978
You could use na.approx
from package zoo
:
library(dplyr)
library(zoo)
df %>%
mutate(diff = na.approx(x) - lag(na.approx(x)))
which gives you
# A tibble: 5 x 3
date x diff
<date> <dbl> <dbl>
1 2021-01-01 0 NA
2 2021-01-02 10 10
3 2021-01-02 30 20
4 2021-01-03 NA 15
5 2021-01-04 60 15
With lag(x, default = 0)
you can handle the NA
at the beginning of your data.frame.
Upvotes: 2