Reputation: 1213
I have a series of text files (100 files) with same structure (same number of columns) that need to be averaged for each element in all files. What I use is this code:
my.list<-list()
for(j in 1:100){
my.list[[j]]<-read.table(paste('file_',j,'.txt',sep='')
,header=TRUE)
}
all=Reduce("+", my.list) / length(my.list)
This works quit fine when all the data frames have the same size. But when I have more rows in several data frames I get this error:
Error in Ops.data.frame(init, x[[i]]) : ‘+’ only defined for equally-sized data frames
I would like to know if there is any solution to make the average of existing number for the rows that are not presented in all the data frames.
Here is example files with 1 column (I just put two files but in my data is 100 files):
file1:
x
1 3
2 5
3 1
4 8
5 2
6 6
7 0
file2:
x
1 4
2 2
3 6
4 1
5 9
6 2
and I want an output:
x
1 3.5
2 3.5
3 3.5
4 4.5
5 5.5
6 4
7 0
thanks in advance
Upvotes: 1
Views: 1232
Reputation: 4921
The following solution uses the zoo package
install.packages("zoo")
library(zoo)
Two unequal vectors
file1<-c(3,5,1,8,2,6,0)
file2<-c(4,2,6,1,9,2)
z1<-zoo(file1)
z2<-zoo(file2)
dat<-cbind(z1,z2)
rowMeans(dat, na.rm=TRUE)
For more than 2 vectors:
file3<-c(2,3)
z3<-zoo(file3)
dat<-cbind(z1,z2,z3)
rowMeans(dat, na.rm=TRUE)
Unequal matrices
z1<-zoo(cbind(c(1,2,3),c(0,2,7)))
z2<-zoo(cbind(c(0,3,4,7,2),c(1,4,2,3,8)))
dat<-cbind(z1,z2)
rowMeans(dat, na.rm=TRUE)
Note: in the provided examples the missing elements are filled up with NA's. If you use cbind(z1,z2,fill=0) the NA's are filled up with zero values which produces different row mean values.
Upvotes: 3
Reputation: 887891
We could use rbindlist
from data.table
to "rbind" the list elements (list(x,y)
). The id.col=TRUE
generates the id
for each list elements. Create a sequence index (.idx
) grouped by ".id", use the ".idx" as grouping variable and get the means of subset of datatable (.SD
) by using lapply
. The .SDcols
specifies the columns to select for "mean" calculation.
library(data.table) # data.table_1.9.5
rbindlist(list(x,y), idcol=TRUE)[,.idx:=1:.N , by = .id][,
lapply(.SD, mean), .SDcols=2:3, by= .idx]
# .idx Col1 Col2
#1: 1 1 3.5
#2: 2 2 3.5
#3: 3 3 3.5
#4: 4 4 4.5
#5: 5 5 5.5
#6: 6 6 4.0
#7: 7 7 0.0
Or we could use another option suggested by @David Arenburg. The steps are the same as above except using colMeans
instead of lapply(.SD,
.
rbindlist(list(x,y), idcol=TRUE)[, .idx:=1:.N, by =.id][,
as.list(colMeans(.SD)), .SDcols=2:3, by = .idx]
Suppose if the datasets have variable ncols
from multiple datasets, we could read all the files from the working directory using fread
, use rbindlist
with fill=TRUE
, and use the code as above with appropriate changes in the .SDcols
files <- list.files()
lst <- lapply(files, fread)
rbindlist(lst, idcol=TRUE, fill=TRUE)[, .idx:=1:.N, by =.id][,
as.list(colMeans(.SD, na.rm=TRUE)), .SDcols=2:3, by = .idx]
Using an example
lst <- list(x,y,z)
rbindlist(lst, idcol=TRUE, fill=TRUE)[,
.idx:=1:.N, by=.id][, as.list(colMeans(.SD, na.rm=TRUE)),
.SDcols=2:4, by=.idx][,-1, with=FALSE]
# Col1 Col2 Col3
#1: 1 3.000000 3
#2: 2 3.333333 4
#3: 3 3.666667 5
#4: 4 4.666667 6
#5: 5 5.666667 7
#6: 6 4.000000 NaN
#7: 7 0.000000 NaN
A base R
option would be to create a list ("lst1") with list elements have equal dimensions. This can be done by first creating a NA
dataset "d1" with dimensions equal to "maximum" nrow/ncol from checking the dimensions of "lst" (Dim1 <- sapply(..)
). Use Map
to fill the elements of "d1" that match with the "rownames/colnames" of each "lst" element. The output "lst1" can be converted to array (ar1
), use apply
with margins (c(1,2)
) to get the mean
value.
Dim1 <- sapply(lst, dim)
d1 <- as.data.frame(matrix(NA, ncol=max(Dim1[2,]),
nrow=max(Dim1[1,])))
nm1 <- unique(unlist(sapply(lst, colnames)))
names(d1) <- nm1
lst1 <-Map(function(x,y) {
y[match(row.names(x), row.names(y)),
match(colnames(x), colnames(y))] <- x
y },
lst, list(d1))
ar1 <- array(unlist(lst1), dim=c(max(Dim1[1,]), max(Dim1[2,]), length(lst1)))
apply(ar1, c(1,2), mean, na.rm=TRUE)
# [,1] [,2] [,3]
#[1,] 1 3.000000 3
#[2,] 2 3.333333 4
#[3,] 3 3.666667 5
#[4,] 4 4.666667 6
#[5,] 5 5.666667 7
#[6,] 6 4.000000 NaN
#[7,] 7 0.000000 NaN
x <- structure(list(Col1 = 1:7, Col2 = c(3L, 5L, 1L, 8L, 2L, 6L, 0L
)), .Names = c("Col1", "Col2"), class = "data.frame", row.names =
c(NA, -7L))
y <- structure(list(Col1 = 1:6, Col2 = c(4L, 2L, 6L, 1L, 9L, 2L)),
.Names = c("Col1", "Col2"), class = "data.frame", row.names = c(NA,
-6L))
z <- data.frame(Col1=1:5, Col2=2:6, Col3=3:7)
Upvotes: 2