Reputation: 1117
I have a big data set (roughly 10 000 rows), and want to create a function that counts the number of complete cases (not NAs) per group. I tried various functions (aggregate, table, sum(complete.cases), group_by, etc), but somehow I miss one - probably little - trick. Thanks for any help!
A little sample data set to explain, the result I need.
x <- data.frame(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
print(x)
# group age speed
#1 1 4 12
#2 2 3 NA
#3 3 2 15
#4 4 1 NA
#5 1 11 12
#6 2 NA NA
#7 3 13 15
#8 4 NA NA
One function I wrote reads as follows:
CountPerGroup <- function(group) {
data.set <- subset(x,group %in% group)
vect <- vector()
for (i in 1:length(group)) {
vect[i] <- sum(complete.cases(data.set))
}
output <- data.frame(cbind(group,count=vect))
return(output)
}
The result of
CountPerGroup(2:1)
is
group count
1 2 4
2 1 4
Unfortunately, this is wrong. Instead the outcome should look like
group count
1 2 1
2 1 4
What am I missing? How can I tell R to count of complete.cases per Group? Thank you very much for any help on this!
Upvotes: 3
Views: 4227
Reputation: 1268
I just had the same problem and found an easier solution
library(data.table)
x <- data.table(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
x[,sum(complete.cases(.SD)), by=group]
Upvotes: 0
Reputation: 10253
Something like should do the trick if you wish to maintain your functionality:
x <- data.frame(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
CountPerGroup <- function(x, groups) {
data.set <- subset(x, group %in% groups)
ans <- sapply(split(data.set, data.set$group),
function(y) sum(complete.cases(y)))
return(data.frame(group = names(ans), count = unname(ans)))
}
CountPerGroup(x, 1:2)
# group count
#1 1 2
#2 2 0
Which is correct from what I can count. But it does not agree with your suggested outcome.
EDIT
It seems that you want the number of non-NA
instead and correctly sorted. Use this function instead:
CountPerGroup2 <- function(x, groups) {
data.set <- subset(x, group %in% groups)
ans <- sapply(split(data.set, data.set$group),
function(y) sum(!is.na(y[, !grepl("group", names(y))])))[groups]
return(data.frame(group = names(ans), count = unname(ans)))
}
CountPerGroup2(x, 2:1)
# group count
#1 2 1
#2 1 4
Upvotes: 3
Reputation: 2318
If you are just looking for a way to get the full count of non-NA values per group, you could use something like:
library(plyr)
x <- data.frame(group = c(1:4),
age = c(4:1, c(11, NA,13, NA)),
speed = c(12, NA,15,NA))
counts <- ddply(x, "group", summarize, count=sum(!is.na(c(age, speed))))
## group count
## 1 1 4
## 2 2 1
## 3 3 4
## 4 4 1
You do miss out on having a function that lets you query a subset of the groups, but you get a one-line way to calculate the full solution.
Upvotes: 2
Reputation: 31181
Here is a way with data.table
library(data.table)
library(functional)
countPerGroup = function(x, vec)
{
dt = data.table(x)
d1 = setkey(dt, group)[group %in% vec]
d2 = d1[,lapply(.SD, Compose(Negate(is.na), sum)),by=group]
transform(d2, count=age+speed, speed=NULL, age=NULL)
}
countPerGroup(x, 1:2)
# group count
#1: 1 4
#2: 2 1
countPerGroup(x, c(1,2))
# group count
#1: 1 4
#2: 2 1
If you have a high number of lines in your data.table
, it is particularly efficient!
Upvotes: 0