Aquarius
Aquarius

Reputation: 262

How to calculate monthly average of multiple columns in dataframe based on Date column with missind data in R

I have large number of columns in my dataframe over 4000. One column is of Date and rest are Companies (Column Names). I have daily obervations for 14 years (as rows) making it be 164 months .I want to calculate mothly averages based on the Date column and most important of all the average should only be calculated when there are atleast 15 observations for each column(company) otherwise should return NA.

df<- Spread
Date             A             B    C
2000-01-04  0.062893082 0.030769231 NA
2000-01-05  0.062893082 0.015503876 NA
2000-01-06  0.062893082 NA          NA
2000-01-07  0.062893082 NA          NA
2000-01-10  0.062893082 NA          NA
2000-01-11  0.062893082 NA          NA
2000-01-12  0.062893082 NA          NA
2000-01-13  0.062893082 NA          NA
2000-01-14  0.062893082 NA          NA
2000-01-17  0.052910053 NA          NA
2000-01-18  0.031413613 NA          NA
2000-01-19  0.052910053 NA          NA
2000-01-20  0.051282051 NA          NA
2000-01-21  0.051282051 0.014184397 NA
2000-01-24  0.051282051 0.014184397 NA
2000-01-25  0.051282051 0.014184397 NA
2000-01-26  0.051282051 0.014184397 NA
2000-01-27  0.051282051 0.019914651 NA
2000-01-28  0.031088083 0.028571429 NA
2000-01-31  0.031088083 0.028571429 NA

The output I want

Monthly<- df
Month          A        B   C
Jan-2000    0.053656996 NA  NA

I would really apprecaite your help. And any ideas I want round off these values to 4 decimal places for instance 0.062893082 to 0.0628.

Upvotes: 2

Views: 1951

Answers (1)

akrun
akrun

Reputation: 887951

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), then we use format to extract the month-year (after converting to Date class). This can be used as a grouping variable. We loop through the columns (lapply(.SD,...) and if the length of non-NA elements are greater than or equal to 15 get the mean or else return as NA.

library(data.table)
setDT(df1)[,lapply(.SD, function(x) if(length(na.omit(x)) >=15)
       mean(x, na.rm=TRUE) else NA_real_) ,
             by = .(Month= format(as.IDate(Date), '%b-%Y'))]
#      Month        A  B  C
#1: Jan-2000 0.053657 NA NA

A similar approach using dplyr would be

library(dplyr)
df1 %>% 
    group_by(Month = format(as.Date(Date), '%b-%Y')) %>%
    summarise_each(funs( if(length(na.omit(.))>=15) 
                       mean(., na.rm=TRUE) else NA_real_), A:C)
#    Month        A     B     C
#     (chr)    (dbl) (dbl) (dbl)
#1 Jan-2000 0.053657    NA    NA

data

df1 <- structure(list(Date = c("2000-01-04", "2000-01-05", "2000-01-06", 
"2000-01-07", "2000-01-10", "2000-01-11", "2000-01-12", "2000-01-13", 
"2000-01-14", "2000-01-17", "2000-01-18", "2000-01-19", "2000-01-20", 
"2000-01-21", "2000-01-24", "2000-01-25", "2000-01-26", "2000-01-27", 
"2000-01-28", "2000-01-31"), A = c(0.062893082, 0.062893082, 
0.062893082, 0.062893082, 0.062893082, 0.062893082, 0.062893082, 
0.062893082, 0.062893082, 0.052910053, 0.031413613, 0.052910053, 
0.051282051, 0.051282051, 0.051282051, 0.051282051, 0.051282051, 
0.051282051, 0.031088083, 0.031088083), B = c(0.030769231, 0.015503876, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.014184397, 0.014184397, 
0.014184397, 0.014184397, 0.019914651, 0.028571429, 0.028571429
), C = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA)), .Names = c("Date", "A", "B", "C"
), class = "data.frame", row.names = c(NA, -20L))

Upvotes: 3

Related Questions