Jakob
Jakob

Reputation: 1453

Add mean columns to dataframe by category with NA's

I'm trying to add a column of the means to my data-frame with some code I can run over many different columns even if they include NA's (which I think excludes many possibilities).

The best I could do is:

TestData <- data.frame(geo=c(rep("AT",4),rep("DE",4)),time=c(rep(c(1990:1993),2)),value=c(NA,4,20,6,NA,NA,5,3))

mean <- aggregate(value~geo, TestData, mean)

which calculates the correct means by category (geo). How can I rightjoin them to the dataframe so that the mean is not just one observation but shows up at every time point? I'm thinking about ddply but can't get it to work properly. The data frame I'm looking for is:

    geo time    value   mean   (or optionally, no problem for me)
1   AT  1990    NA     10      NA
2   AT  1991    4      10      10
3   AT  1992    20     10      10
4   AT  1993    6      10      10
5   DE  1990    NA     4       NA
6   DE  1991    NA     4       NA
7   DE  1992    5      4       4
8   DE  1993    3      4       4

Any help would be appreciated!

Upvotes: 2

Views: 157

Answers (1)

akrun
akrun

Reputation: 887148

Try:

 testData1 <-  within(TestData, {
                    Mean <- ave(value, geo, FUN=function(x) mean(x, na.rm=TRUE))
                    Mean[is.na(value)] <- NA}) #If you don't want `NA` values don't use this step


  testData1
  # geo time value Mean
 #1  AT 1990    NA   NA
 #2  AT 1991     4   10
 #3  AT 1992    20   10
 #4  AT 1993     6   10
 #5  DE 1990    NA   NA
 #6  DE 1991    NA   NA
 #7  DE 1992     5    4
 #8  DE 1993     3    4

If you want to find mean of multiple columns that starts with name value:

For example:

 TestData1 <- TestData
 TestData1$value2 <- c(4, NA, 25, NA, NA, 10,5, 2)


 library(dplyr)

 res <- left_join(TestData1,
              TestData1 %>% 
                        group_by(geo) %>%
                        mutate_each(funs(mean=mean(., na.rm=TRUE)), starts_with("value")), 
                          by=c("geo", "time"))


 colnames(res) <- gsub("\\.y$", ".mean", colnames(res))
 res
 #  geo time value.x value2.x value.mean value2.mean
 #1  AT 1990      NA        4         10   14.500000
 #2  AT 1991       4       NA         10   14.500000
 #3  AT 1992      20       25         10   14.500000
 #4  AT 1993       6       NA         10   14.500000
 #5  DE 1990      NA       NA          4    5.666667
 #6  DE 1991      NA       10          4    5.666667
 #7  DE 1992       5        5          4    5.666667
 #8  DE 1993       3        2          4    5.666667

Upvotes: 2

Related Questions