Reputation: 262
I have large number of columns in my dataframe over 4000. One column is of Date and rest are Companies (Column Names). I have daily obervations for 14 years (as rows) making it be 164 months .I want to calculate mothly averages based on the Date column and most important of all the average should only be calculated when there are atleast 15 observations for each column(company) otherwise should return NA.
df<- Spread
Date A B C
2000-01-04 0.062893082 0.030769231 NA
2000-01-05 0.062893082 0.015503876 NA
2000-01-06 0.062893082 NA NA
2000-01-07 0.062893082 NA NA
2000-01-10 0.062893082 NA NA
2000-01-11 0.062893082 NA NA
2000-01-12 0.062893082 NA NA
2000-01-13 0.062893082 NA NA
2000-01-14 0.062893082 NA NA
2000-01-17 0.052910053 NA NA
2000-01-18 0.031413613 NA NA
2000-01-19 0.052910053 NA NA
2000-01-20 0.051282051 NA NA
2000-01-21 0.051282051 0.014184397 NA
2000-01-24 0.051282051 0.014184397 NA
2000-01-25 0.051282051 0.014184397 NA
2000-01-26 0.051282051 0.014184397 NA
2000-01-27 0.051282051 0.019914651 NA
2000-01-28 0.031088083 0.028571429 NA
2000-01-31 0.031088083 0.028571429 NA
The output I want
Monthly<- df
Month A B C
Jan-2000 0.053656996 NA NA
I would really apprecaite your help. And any ideas I want round off these values to 4 decimal places for instance 0.062893082 to 0.0628.
Upvotes: 2
Views: 1951
Reputation: 887951
We can use data.table
. We convert the 'data.frame' to 'data.table' (setDT(df1)
), then we use format
to extract the month-year (after converting to Date
class). This can be used as a grouping variable. We loop through the columns (lapply(.SD,...
) and if
the length
of non-NA elements are greater than or equal to 15 get the mean
or else
return as NA.
library(data.table)
setDT(df1)[,lapply(.SD, function(x) if(length(na.omit(x)) >=15)
mean(x, na.rm=TRUE) else NA_real_) ,
by = .(Month= format(as.IDate(Date), '%b-%Y'))]
# Month A B C
#1: Jan-2000 0.053657 NA NA
A similar approach using dplyr
would be
library(dplyr)
df1 %>%
group_by(Month = format(as.Date(Date), '%b-%Y')) %>%
summarise_each(funs( if(length(na.omit(.))>=15)
mean(., na.rm=TRUE) else NA_real_), A:C)
# Month A B C
# (chr) (dbl) (dbl) (dbl)
#1 Jan-2000 0.053657 NA NA
df1 <- structure(list(Date = c("2000-01-04", "2000-01-05", "2000-01-06",
"2000-01-07", "2000-01-10", "2000-01-11", "2000-01-12", "2000-01-13",
"2000-01-14", "2000-01-17", "2000-01-18", "2000-01-19", "2000-01-20",
"2000-01-21", "2000-01-24", "2000-01-25", "2000-01-26", "2000-01-27",
"2000-01-28", "2000-01-31"), A = c(0.062893082, 0.062893082,
0.062893082, 0.062893082, 0.062893082, 0.062893082, 0.062893082,
0.062893082, 0.062893082, 0.052910053, 0.031413613, 0.052910053,
0.051282051, 0.051282051, 0.051282051, 0.051282051, 0.051282051,
0.051282051, 0.031088083, 0.031088083), B = c(0.030769231, 0.015503876,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.014184397, 0.014184397,
0.014184397, 0.014184397, 0.019914651, 0.028571429, 0.028571429
), C = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA)), .Names = c("Date", "A", "B", "C"
), class = "data.frame", row.names = c(NA, -20L))
Upvotes: 3