Reputation: 212
I have looked through various Overflow pages with similar questions (some linked) but haven't found anything that seems to help with this complicated task.
I have a series of data frames in my workspace and I would like to loop the same function (rollmean or some version of that) over all of them, then save the results to new data frames.
I have written a couple of lines of to generate a list of all data frames and a for loop that should iterate an apply statement over each data frame; however, I'm having problems trying to accomplish everything I'm hoping to achieve (my code and some sample data are included below):
1) I would like to restrict the rollmean
function to all columns, except the 1st (or first several), so that the column(s) 'info' does not get averaged. I would also like to add this column(s) back to the output data frame.
2) I want to save the output as a new data frame (with a unique name). I do not care if it is saved to the workspace or exported as an xlsx, as I already have batch import codes written.
3) Ideally, I would like the resultant data frame to be the same number of observations as the input, where as rollmean
shrinks your data. I also do not want these to become NA, so I don't want to use fill = NA
This could be accomplished by writing a new function, passing type = "partial"
in rollmean
(though that still shrinks my data by 1 in my hands), or by starting the roll mean on the nth+2 term and binding the non averaged nth and nth+1 terms to the resulting data frame. Any way is fine.
(see picture for detail, it illustrates what the later would look like)
My code only accomplishes parts of these things and I cannot get the for loop to work together but can get parts to work if I run them on single data frames.
Any input is greatly appreciated because I'm out of ideas.
#reproducible data frames
a = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
b = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
c = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
colnames(a) = c("info", 1:20)
colnames(b) = c("info", 1:20)
colnames(c) = c("info", 1:20)
#identify all dataframes for looping rollmean
dflist = as.list(ls()[sapply(mget(ls(), .GlobalEnv), is.data.frame)]
#for loop to create rolling average and save as new dataframe
for (j in 1:length(dflist)){
list = as.list(ls()[sapply(mget(ls(), .GlobalEnv), is.data.frame)])
new.names = as.character(unique(list))
smoothed = as.data.frame(
apply(
X = names(list), MARGIN = 1, FUN = rollmean, k = 3, align = 'right'))
assign(new.names[i], smoothed)
}
I also tried a nested apply approach but couldn't get it to call the rollmean/rollapply function similar to issue here so I went back to for loops but if someone can make this work with nested applies, I'm down!
Picture is ideal output: Top is single input dataframe with colored boxes demonstrating a rolling average across all columns, to be iterated over each column; bottom is ideal output with colors reflecting the location of output for each colored window above
Upvotes: 3
Views: 1871
Reputation: 269586
Below dfnames
is the names of the data frames in env
, the global environment -- we have named it env
in case you want to later change where they are located. Note that ls
has a pattern=
argument and if the data frame names have a distinct pattern then dfnames <- ls(pattern=whatever)
could be used instead where whatever is a suitable regular expression.
Now define make_new
which calls rollapplyr
with a new mean function mean3
which returns the last value of its input if the input vector has a length less than 3 and mean otherwise. Then loop over the names using rollappyr
with FUN=mean3
and partial=TRUE
.
library(zoo)
env <- .GlobalEnv
dfnames <- Filter(function(x) is.data.frame(get(x, env)), ls(env))
# make_new - first version
mean3 <- function(x, k = 3) if (length(x) < k) tail(x, 1) else mean(x)
make_new <- function(df) replace(df, -1, rollapplyr(df[-1], 3, mean3, partial = TRUE))
for(nm in dfnames) env[[paste(nm, "new", sep = "_")]] <- make_new(get(nm, env))
An alternative to the first version of make_new shown above is the following second version. In the second version instead of defining mean3
we use just plain mean
but specify a vector of widths w
in rollapplyr
such that w
equals c(1, 1, 3, 3, ..., 3). Thus it takes the mean of just the last element for the first two input components and the mean of the 3 last elements for the rest. Note that now that we specify the widths explicitly we no longer need to specify partial=
.
# make_new -- second version
make_new <- function(df) {
w <- replace(rep(3, nrow(df)), 1:2, 1)
replace(df, -1, rollapplyr(df[-1], w, mean))
}
Normally when writing R and manpulating a set of objects one stores the objects in a list rather than leaving them loose in the global environment. We could create such a list L
like this and then use lapply
to create a second list L2
containing the new versions. Either version of make_new
would work here.
L <- mget(dfnames, env)
L2 <- lapply(L, make_new)
Upvotes: 1
Reputation: 160437
To approach this, think about one column, then one frame (which is just a list of columns), then a list of frames.
(My data used is at the bottom of the answer.)
If you don't like the reduction of zoo::rollmean
, then write your own:
myrollmean <- function(x, k, ..., type=c("normal","rollin","keep"), na.rm=FALSE) {
type <- match.arg(type)
out <- zoo::rollmean(x, k, ...)
aug <- c()
if (type == "rollin") {
# effectively:
# c(mean(x[1]), mean(x[1:2]), ..., mean(x[1:j]))
# for the j=k-1 elements that precede the first from rollmean,
# when it'll become something like:
# c(mean(x[3:5]), mean(x[4:6]), ...)
aug <- sapply(seq_len(k-1), function(i) mean(x[seq_len(i)], na.rm=na.rm))
} else if (type == "keep") {
aug <- x[seq_len(k-1)]
}
out <- c(aug, out)
out
}
myrollmean(1:8, k=3) # "normal", default behavior
# [1] 2 3 4 5 6 7
myrollmean(1:8, k=3, type="rollin")
# [1] 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0
myrollmean(1:8, k=3, type="keep")
# [1] 1 2 2 3 4 5 6 7
I caution that this implementation is a bit naïve at best, and needs to be fixed. Make sure that you understand what it is doing when you pick other than "normal"
(which will not work for you, I'm just defaulting to the normal zoo::rollmean
behavior). This function could easily be applied to other zoo::roll*
functions.
On one column of the data:
rbind(
dflist[[1]][,2], # for comparison
myrollmean(dflist[[1]][,2], k=3, type="keep")
)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1.865352 0.4047481 0.1466527 1.7307097 0.08952618 0.6668976 1.0743669 1.511629 1.314276 0.1565303
# [2,] 1.865352 0.4047481 0.8055844 0.7607035 0.65562952 0.8290445 0.6102636 1.084298 1.300091 0.9941452
Simple use of lapply
, omitting the first column:
str(dflist[[1]][1:4, 1:3])
# 'data.frame': 4 obs. of 3 variables:
# $ info: num 1 2 3 4
# $ 1 : num 1.865 0.405 0.147 1.731
# $ 2 : num 0.745 1.243 0.674 1.59
dflist[[1]][-1] <- lapply(dflist[[1]][-1], myrollmean, k=3, type="keep")
str(dflist[[1]][1:4, 1:3])
# 'data.frame': 4 obs. of 3 variables:
# $ info: num 1 2 3 4
# $ 1 : num 1.865 0.405 0.806 0.761
# $ 2 : num 0.745 1.243 0.887 1.169
(For validation, column $ 1
matches the second row in the "one column" example above.)
(I reset the data to what it was before I modified it above ... see the "data" code at the bottom of the answer.)
We nest the previous technique into another lapply
:
dflist2 <- lapply(dflist, function(ldf) {
ldf[-1] <- lapply(ldf[-1], myrollmean, k=3, type="keep")
ldf
})
str(lapply(dflist2, function(a) a[1:4, 1:3]))
# List of 3
# $ :'data.frame': 4 obs. of 3 variables:
# ..$ info: num [1:4] 1 2 3 4
# ..$ 1 : num [1:4] 1.865 0.405 0.806 0.761
# ..$ 2 : num [1:4] 0.745 1.243 0.887 1.169
# $ :'data.frame': 4 obs. of 3 variables:
# ..$ info: num [1:4] 1 2 3 4
# ..$ 1 : num [1:4] 0.271 3.611 2.36 3.095
# ..$ 2 : num [1:4] 0.127 0.722 0.346 0.73
# $ :'data.frame': 4 obs. of 3 variables:
# ..$ info: num [1:4] 1 2 3 4
# ..$ 1 : num [1:4] 1.278 0.346 1.202 0.822
# ..$ 2 : num [1:4] 0.341 1.296 1.244 1.528
(Again, for simple validation, see that the first frame's $ 1
row shows the same rolled means as the second row of the "one column" example, above.)
PS:
lapply
, use instead ldf[-(1:n)] <- lapply(ldf[-(1:n)], myrollmean, k=3, type="keep")
to skip the first n
columnszoo::rollmean
, you'll want to change the special-cases of myrollmean
, though it should be straight-forward enough given this examplestr(...)
to shorten the output for display here. You should verify all of your data that it is doing what you expect for the whole of each frame.set.seed(2)
a = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
b = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
c = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
colnames(a) = c("info", 1:20)
colnames(b) = c("info", 1:20)
colnames(c) = c("info", 1:20)
dflist <- list(a,b,c)
str(lapply(dflist, function(a) a[1:3, 1:4]))
# List of 3
# $ :'data.frame': 3 obs. of 4 variables:
# ..$ info: num [1:3] 1 2 3
# ..$ 1 : num [1:3] 1.865 0.405 0.147
# ..$ 2 : num [1:3] 0.745 1.243 0.674
# ..$ 3 : num [1:3] 0.356 0.689 0.833
# $ :'data.frame': 3 obs. of 4 variables:
# ..$ info: num [1:3] 1 2 3
# ..$ 1 : num [1:3] 0.271 3.611 3.198
# ..$ 2 : num [1:3] 0.127 0.722 0.188
# ..$ 3 : num [1:3] 1.99 2.74 4.78
# $ :'data.frame': 3 obs. of 4 variables:
# ..$ info: num [1:3] 1 2 3
# ..$ 1 : num [1:3] 1.278 0.346 1.981
# ..$ 2 : num [1:3] 0.341 1.296 2.094
# ..$ 3 : num [1:3] 1.1159 3.05877 0.00506
Upvotes: 3