Reputation: 3774
I want to use zoo::na.approx
(but not married to this function) to fill in a response variable for the missing days in my dataframe. I'm having a tough time figuring out how to add the NAs to the original dataframe so that na.approx can fill them in.
My dataframe looks something like this:
df<-data.frame(trt=c("A", "A", "A", "A", "B", "B", "B", "B"),
day = c(1,3,7,9,1,5,8,9),
value = c(7,12,5,7,5,6,11,8),
stringsAsFactors=FALSE)
I want every day to be in the dataframe with "NA" for each day where I don't have data.
I have used something like this to expand my dataset:
library(dplyr)
days_possible <- expand.grid(
day = seq(from=min(df$day), max(df$day), by=1),
trt = c("A", "B"),
stringsAsFactors = FALSE
)
new_df<- df %>%
right_join(days_possible, by = c("trt", "day"))
My problem is that I have a bunch of sites, years, and a few treatment columns, so somewhere it seems to all get messed up and in my days_possible
dataframe, I don't get it right.
Is there a function to avoid this mess, expand one column, and have all the other columns expand in a tidy fashion? I'm looking at modelr::data_grid
, but by itself I am not sure how to get the final desired result - an ordered dataframe I can group by treatments and use approximation to fill in missing days.
Upvotes: 1
Views: 649
Reputation: 270298
We have added a value2
column to df
in the Note at the end to show that this works with additional columns.
Note that df
is not a time series and na.approx
is intended to operate on time series. To convert it to one read it into zoo object wide0
and then merge it with a full set of days. Now we can apply na.approx
directly as discussed.
library(magrittr)
library(zoo)
wide <- df %>%
read.zoo(index = "day", split = "trt") %>%
merge(zoo(, start(.):end(.) + 0)) %>%
na.approx
giving:
> wide
value.A value2.A value.B value2.B
1 7.00 1.00 5.000000 5.000000
2 9.50 1.50 5.250000 5.250000
3 12.00 2.00 5.500000 5.500000
4 10.25 2.25 5.750000 5.750000
5 8.50 2.50 6.000000 6.000000
6 6.75 2.75 7.666667 6.333333
7 5.00 3.00 9.333333 6.666667
8 6.00 NA 11.000000 7.000000
9 7.00 NA 8.000000 8.000000
The NAs above are due to the fact that one cannot interpolate without values on both sides; however, na.approx
does have additional arguments to fill those in if you want to.
The wide form with separate columns for each variable/group shown above may be the most convenient but if not we could convert it back to long form using fortify.zoo
and possibly spread out the variables into one column each again.
library(tidyr)
wide %>%
fortify.zoo(wide, melt = TRUE, sep = ".",
names = list("day", c("variable", "group"), "value")) %>%
spread(variable, value)
giving:
day group value value2
1 1 A 7.000000 1.000000
2 1 B 5.000000 5.000000
3 2 A 9.500000 1.500000
4 2 B 5.250000 5.250000
5 3 A 12.000000 2.000000
6 3 B 5.500000 5.500000
7 4 A 10.250000 2.250000
8 4 B 5.750000 5.750000
9 5 A 8.500000 2.500000
10 5 B 6.000000 6.000000
11 6 A 6.750000 2.750000
12 6 B 7.666667 6.333333
13 7 A 5.000000 3.000000
14 7 B 9.333333 6.666667
15 8 A 6.000000 NA
16 8 B 11.000000 7.000000
17 9 A 7.000000 NA
18 9 B 8.000000 8.000000
The input in reproducible form. We have added a value2
column to show it still works.
df<-data.frame(trt=c("A", "A", "A", "A", "B", "B", "B", "B"),
day = c(1,3,7,9,1,5,8,9),
value = c(7,12,5,7,5,6,11,8),
stringsAsFactors=FALSE)
df$value2 <- c(1:3, NA, 5:8)
Upvotes: 0
Reputation: 39174
We can use the complete
and full_seq
functions from the tidyr
package. The final as.data.frame()
is not required. I just added it to print the output as a data frame.
library(tidyr)
df2 <- df %>%
complete(trt, day = full_seq(day, period = 1)) %>%
as.data.frame()
df2
# trt day value
# 1 A 1 7
# 2 A 2 NA
# 3 A 3 12
# 4 A 4 NA
# 5 A 5 NA
# 6 A 6 NA
# 7 A 7 5
# 8 A 8 NA
# 9 A 9 7
# 10 B 1 5
# 11 B 2 NA
# 12 B 3 NA
# 13 B 4 NA
# 14 B 5 6
# 15 B 6 NA
# 16 B 7 NA
# 17 B 8 11
# 18 B 9 8
Upvotes: 2