user2006697
user2006697

Reputation: 1117

Count Complete Cases per Group

I have a big data set (roughly 10 000 rows), and want to create a function that counts the number of complete cases (not NAs) per group. I tried various functions (aggregate, table, sum(complete.cases), group_by, etc), but somehow I miss one - probably little - trick. Thanks for any help!

A little sample data set to explain, the result I need.

x <- data.frame(group = c(1:4), 
                age = c(4:1, c(11, NA,13, NA)), 
                speed = c(12, NA,15,NA))
print(x)
#  group age speed
#1     1   4    12
#2     2   3    NA
#3     3   2    15
#4     4   1    NA
#5     1  11    12
#6     2  NA    NA
#7     3  13    15
#8     4  NA    NA

One function I wrote reads as follows:

CountPerGroup <- function(group) {
    data.set <- subset(x,group %in% group)

    vect <- vector()
    for (i in 1:length(group)) {
        vect[i] <- sum(complete.cases(data.set))
    }
    output <- data.frame(cbind(group,count=vect))   
    return(output)

}

The result of

CountPerGroup(2:1)

is

  group count
1     2     4
2     1     4

Unfortunately, this is wrong. Instead the outcome should look like

  group count
1     2     1
2     1     4

What am I missing? How can I tell R to count of complete.cases per Group? Thank you very much for any help on this!

Upvotes: 3

Views: 4227

Answers (4)

Adam Waring
Adam Waring

Reputation: 1268

I just had the same problem and found an easier solution

library(data.table)

x <- data.table(group = c(1:4), 
                age = c(4:1, c(11, NA,13, NA)), 
                speed = c(12, NA,15,NA))
x[,sum(complete.cases(.SD)), by=group]

Upvotes: 0

Anders Ellern Bilgrau
Anders Ellern Bilgrau

Reputation: 10253

Something like should do the trick if you wish to maintain your functionality:

x <- data.frame(group = c(1:4), 
                age = c(4:1, c(11, NA,13, NA)), 
                speed = c(12, NA,15,NA))

CountPerGroup <- function(x, groups) {
  data.set <- subset(x, group %in% groups)
  ans <- sapply(split(data.set, data.set$group), 
                function(y) sum(complete.cases(y)))
  return(data.frame(group = names(ans), count = unname(ans)))
}


CountPerGroup(x, 1:2)
#  group count
#1     1     2
#2     2     0

Which is correct from what I can count. But it does not agree with your suggested outcome.

EDIT

It seems that you want the number of non-NA instead and correctly sorted. Use this function instead:

CountPerGroup2 <- function(x, groups) {
   data.set <- subset(x, group %in% groups)
   ans <- sapply(split(data.set, data.set$group), 
                 function(y) sum(!is.na(y[, !grepl("group", names(y))])))[groups]
   return(data.frame(group = names(ans), count = unname(ans)))
}

CountPerGroup2(x, 2:1)
#  group count
#1     2     1
#2     1     4

Upvotes: 3

Zachary Cross
Zachary Cross

Reputation: 2318

If you are just looking for a way to get the full count of non-NA values per group, you could use something like:

library(plyr)
x <- data.frame(group = c(1:4), 
                age = c(4:1, c(11, NA,13, NA)), 
                speed = c(12, NA,15,NA))

counts <- ddply(x, "group", summarize, count=sum(!is.na(c(age, speed))))

##   group count
## 1     1     4
## 2     2     1
## 3     3     4
## 4     4     1

You do miss out on having a function that lets you query a subset of the groups, but you get a one-line way to calculate the full solution.

Upvotes: 2

Colonel Beauvel
Colonel Beauvel

Reputation: 31181

Here is a way with data.table

library(data.table)
library(functional)

countPerGroup = function(x, vec)
{
    dt = data.table(x) 
    d1 = setkey(dt, group)[group %in% vec]
    d2 = d1[,lapply(.SD, Compose(Negate(is.na), sum)),by=group]
    transform(d2, count=age+speed, speed=NULL, age=NULL)
}


countPerGroup(x, 1:2)
#   group count
#1:     1     4
#2:     2     1

countPerGroup(x, c(1,2))
#   group count
#1:     1     4
#2:     2     1

If you have a high number of lines in your data.table, it is particularly efficient!

Upvotes: 0

Related Questions