B. Davis
B. Davis

Reputation: 3441

summarize data with NAs using ddply function

I am trying to summarize data with NA values and am using the ddply function.

For example, using the data included below,

set.seed(123)
dat <- data.frame(IndID = rep(c("AAA", "BBB", "CCC"), 100),
        ValOne = sample(c(1, 0, NA), replace = T, 300),
        ValTwo = sample(c(1,NA), replace = T, 300),
        VarThree = sample(c("Thanks", "alot"), replace = T, 300))

> head(dat)
  IndID ValOne ValTwo
1   AAA      1     NA
2   BBB     NA      1
3   CCC      0     NA
4   AAA     NA     NA
5   BBB     NA     NA
6   CCC      1      1

I want to calculate the number of times that each individual has a 1 in the ValOne and ValTwo column I am using the code below to create a new data.frame and summarize the data by IndID and use both length and sum functions.

library(plyr)
tbl <- ddply(dat, "IndID", summarise, 
    ColOne = length(dat$ValOne[dat$ColOne == 1]),
    NumHighHDOP = sum(dat$ValTwo[dat$ValTwo == 1], na.rm = T))

As seen below,

 IndID ColOne NumHighHDOP
1   AAA      0         155
2   BBB      0         155
3   CCC      0         155

the resulting table summarizes the data for the entire data.frame and not for each individual.

Both approaches (length and sum) are struggling with the NAs in the data.frame. Any suggestions would be appreciated.

EDIT With the new data set including a factor. Is it also possible to calculate the number of "Thanks" for each individual?

Upvotes: 2

Views: 4738

Answers (1)

akrun
akrun

Reputation: 887068

We can use dplyr. We group by 'IndID', and get the count of 1 for each column with summarise_each. To remove the NA elements, either na.omit or use a logical condition to output TRUE only for 1.

library(dplyr)
dat %>%
   group_by(IndID) %>% 
   summarise_each(funs(sum(.==1 & !is.na(.))))
   #or
   #summarise_each(funs(sum(na.omit(.)==1)))

Update

Based on the updated dataset in the OP's post, if we want to count the 'Thanks' in the third column, we can use %in% (assuming that 'Thanks' is not present in the other two columns and 1 not in the last column).

dat %>% 
   group_by(IndID) %>%
   summarise_each(funs(sum(na.omit(.) %in% c(1, 'Thanks'))))

Upvotes: 3

Related Questions