mahmood
mahmood

Reputation: 1213

How to average elements in several data frames with unequal size?

I have a series of text files (100 files) with same structure (same number of columns) that need to be averaged for each element in all files. What I use is this code:

my.list<-list()
for(j in 1:100){
  my.list[[j]]<-read.table(paste('file_',j,'.txt',sep='')
                               ,header=TRUE)
}

all=Reduce("+", my.list) / length(my.list)

This works quit fine when all the data frames have the same size. But when I have more rows in several data frames I get this error:

Error in Ops.data.frame(init, x[[i]]) : ‘+’ only defined for equally-sized data frames

I would like to know if there is any solution to make the average of existing number for the rows that are not presented in all the data frames.

Here is example files with 1 column (I just put two files but in my data is 100 files):

file1:

             x
1 3
2 5
3 1
4 8
5 2
6 6
7 0

file2:

           x
1 4
2 2
3 6
4 1
5 9
6 2

and I want an output:

          x
1 3.5
2 3.5
3 3.5
4 4.5
5 5.5
6 4
7 0

thanks in advance

Upvotes: 1

Views: 1232

Answers (2)

Ruthger Righart
Ruthger Righart

Reputation: 4921

The following solution uses the zoo package

install.packages("zoo")
library(zoo)

Two unequal vectors

file1<-c(3,5,1,8,2,6,0)
file2<-c(4,2,6,1,9,2)

z1<-zoo(file1)
z2<-zoo(file2)

dat<-cbind(z1,z2)

rowMeans(dat, na.rm=TRUE)

For more than 2 vectors:

file3<-c(2,3)
z3<-zoo(file3)
dat<-cbind(z1,z2,z3)

rowMeans(dat, na.rm=TRUE)

Unequal matrices

z1<-zoo(cbind(c(1,2,3),c(0,2,7)))
z2<-zoo(cbind(c(0,3,4,7,2),c(1,4,2,3,8)))

dat<-cbind(z1,z2)
rowMeans(dat, na.rm=TRUE)

Note: in the provided examples the missing elements are filled up with NA's. If you use cbind(z1,z2,fill=0) the NA's are filled up with zero values which produces different row mean values.

Upvotes: 3

akrun
akrun

Reputation: 887891

We could use rbindlist from data.table to "rbind" the list elements (list(x,y)). The id.col=TRUE generates the id for each list elements. Create a sequence index (.idx) grouped by ".id", use the ".idx" as grouping variable and get the means of subset of datatable (.SD) by using lapply. The .SDcols specifies the columns to select for "mean" calculation.

library(data.table) # data.table_1.9.5 
rbindlist(list(x,y), idcol=TRUE)[,.idx:=1:.N , by = .id][, 
                  lapply(.SD, mean), .SDcols=2:3, by= .idx]
#    .idx Col1 Col2
#1:    1    1  3.5
#2:    2    2  3.5
#3:    3    3  3.5
#4:    4    4  4.5
#5:    5    5  5.5
#6:    6    6  4.0
#7:    7    7  0.0

Or we could use another option suggested by @David Arenburg. The steps are the same as above except using colMeans instead of lapply(.SD,.

rbindlist(list(x,y), idcol=TRUE)[, .idx:=1:.N, by =.id][,
              as.list(colMeans(.SD)), .SDcols=2:3, by = .idx]

Update

Suppose if the datasets have variable ncols from multiple datasets, we could read all the files from the working directory using fread, use rbindlist with fill=TRUE, and use the code as above with appropriate changes in the .SDcols

  files <- list.files() 
  lst <- lapply(files, fread)
  rbindlist(lst, idcol=TRUE, fill=TRUE)[, .idx:=1:.N, by =.id][,
              as.list(colMeans(.SD, na.rm=TRUE)), .SDcols=2:3, by = .idx]

Using an example

  lst <- list(x,y,z)
  rbindlist(lst, idcol=TRUE, fill=TRUE)[, 
     .idx:=1:.N, by=.id][, as.list(colMeans(.SD, na.rm=TRUE)),
               .SDcols=2:4, by=.idx][,-1, with=FALSE]
  #    Col1     Col2 Col3
  #1:    1 3.000000    3
  #2:    2 3.333333    4
  #3:    3 3.666667    5
  #4:    4 4.666667    6
  #5:    5 5.666667    7
  #6:    6 4.000000  NaN
  #7:    7 0.000000  NaN

Update2

A base R option would be to create a list ("lst1") with list elements have equal dimensions. This can be done by first creating a NA dataset "d1" with dimensions equal to "maximum" nrow/ncol from checking the dimensions of "lst" (Dim1 <- sapply(..)). Use Map to fill the elements of "d1" that match with the "rownames/colnames" of each "lst" element. The output "lst1" can be converted to array (ar1), use apply with margins (c(1,2)) to get the mean value.

   Dim1 <- sapply(lst, dim)
   d1 <- as.data.frame(matrix(NA, ncol=max(Dim1[2,]), 
                              nrow=max(Dim1[1,])))
   nm1 <- unique(unlist(sapply(lst, colnames)))
   names(d1) <- nm1
   lst1 <-Map(function(x,y) {
      y[match(row.names(x), row.names(y)),
                  match(colnames(x), colnames(y))] <- x
       y }, 
       lst, list(d1))

 ar1 <- array(unlist(lst1), dim=c(max(Dim1[1,]), max(Dim1[2,]), length(lst1)))
 apply(ar1, c(1,2), mean, na.rm=TRUE)
 #      [,1]     [,2] [,3]
 #[1,]    1 3.000000    3
 #[2,]    2 3.333333    4
 #[3,]    3 3.666667    5
 #[4,]    4 4.666667    6
 #[5,]    5 5.666667    7
 #[6,]    6 4.000000  NaN
 #[7,]    7 0.000000  NaN

data

x <- structure(list(Col1 = 1:7, Col2 = c(3L, 5L, 1L, 8L, 2L, 6L, 0L
)), .Names = c("Col1", "Col2"), class = "data.frame", row.names = 
c(NA, -7L))

y <- structure(list(Col1 = 1:6, Col2 = c(4L, 2L, 6L, 1L, 9L, 2L)), 
.Names = c("Col1", "Col2"), class = "data.frame", row.names = c(NA, 
-6L))

z <- data.frame(Col1=1:5, Col2=2:6, Col3=3:7)

Upvotes: 2

Related Questions