Yamuna_dhungana
Yamuna_dhungana

Reputation: 663

How to omit na in aggregate to calculate SD in R

I have a dataframe that looks like this:

dat <- structure(list(cohort = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ADC8_AA", class = "factor"), 
    status = c(1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 
    1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, -9L, 1L, 1L, 2L, 
    2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 
    2L, 2L, 1L, 2L, -9L, 2L, 1L, -9L, 2L), age_onset = c(NA, 
    NA, NA, NA, 63, NA, 79, NA, 67, 71, 81, NA, NA, NA, NA, 73, 
    NA, 66, 77, 68, 75, NA, NA, NA, NA, 76, 79, NA, NA, NA, NA, 
    NA, 70, NA, 77, 84, 78, 76, NA, 92, 64, 60, 72, NA, 81, NA, 
    62, NA, 82, 74)), row.names = c(NA, 50L), class = "data.frame")

I am trying to get mean and sd like this, but it gets me NA for SD for status ==-9. What could be the reason and how do I do this correctly?

> aggregate(age_onset~cohort+status, data = dat, mean, na.action = na.omit)
   cohort status age_onset
1 ADC8_AA     -9  82.00000
2 ADC8_AA      2  73.54167
> aggregate(age_onset~cohort+status, data = dat, sd)
   cohort status age_onset
1 ADC8_AA     -9        NA
2 ADC8_AA      2  7.661191

Upvotes: 1

Views: 506

Answers (2)

akrun
akrun

Reputation: 887541

We can use dplyr

library(dplyr)
dat %>% 
    group_by(cohort, status) %>%
   summarise(Mean = mean(age_onset, na.rm = TRUE), 
             SD = sd(age_onset, na.rm = TRUE))

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 146010

Try this:

aggregate(age_onset~cohort+status, data = dat, sd, na.rm = TRUE)
#    cohort status age_onset
# 1 ADC8_AA     -9        NA
# 2 ADC8_AA      2  7.661191

You can use the ... argument of aggregate to pass na.rm = TRUE through to sd.

You will still get NA for any groups that only have a single non-missing value. This is because standard deviation isn't defined for a single value.

subset(dat, status == -9)
#     cohort status age_onset
# 23 ADC8_AA     -9        NA
# 46 ADC8_AA     -9        NA
# 49 ADC8_AA     -9        82

sd(82)
# [1] NA

Upvotes: 3

Related Questions