Ashley
Ashley

Reputation: 413

Count number of non-NA values greater than 0 by group

Here is an example of a data set df:

Name    L1     L2     L3    L4
Carl    1       NA     0     2
Carl    0       1     4     1 
Joe     3       0     3     1
Joe     2       2     1     0

I would like to create a function that would be able to tally up the number of values in columns L2, L3, and L4 that are greater than 0 as a function of some name. For example:

someFunction(Joe)
# 4

However, I have some NAs in my columns.

I have tried using complete.cases to remove the NAs but I do not want to remove the entire row. I want to use aggregate, however, I am not exactly sure how. Thanks for your help.

Upvotes: 1

Views: 2140

Answers (4)

Zheyuan Li
Zheyuan Li

Reputation: 73385

We can use

colSums(df[c("L2", "L3", "L4")] > 0, na.rm = TRUE)

Or you may want a sum per person:

m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)

#     L2 L3 L4
#Carl  1  1  2
#Joe   1  2  1

There is something fun here. df[c("L2", "L3", "L4")] > 0 is a logical matrix (with NA):

  • Although colSums can work with it without trouble, rowsum can not. So a fix is to add a 0 to this matrix to cast it to a 0-1 numerical matrix;
  • when adding this 0, we must do (df[c("L2", "L3", "L4")] > 0) + 0 not df[c("L2", "L3", "L4")] > 0 + 0. The operation precedence in R means + is prior to >. Have a try on this toy example:

    5 > 4 + 0  ## FALSE
    (5 > 4) + 0  ## 1
    

    So we want a bracket to evaluate > first, then +.

If you want the result to be a data frame, just cast the resulting matrix into a data frame by:

data.frame(m)

Follow-up

People stop responding, because your specific question on getting a function is less interesting than getting the summary dataset.

Well, if you still take my approach, I would define such function as:

extract <- function (person) {
  m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
  rowSums(m)[[person]]
  }

Then you can call

extract("Joe")
# 4
extract("Carl")
# 4

Note, this is obviously not the most efficient way to write such a function. Because if you only want to extract the sum for one person, there is no need to proceed all data. We can do:

extract2 <- function (person) {
  ## subset data
  sub <- subset(df, df$Name == person, select = c("L2", "L3", "L4"))
  ## get sum
  sum(sub > 0, na.rm = TRUE)
  }

Then you can call

extract2("Joe")
# 4
extract2("Carl")
# 4

Upvotes: 4

akrun
akrun

Reputation: 887571

We can use aggregate with rowSums to get the output

aggregate(cbind(Total=rowSums(df[3:5]>0, 
              na.rm=TRUE))~cbind(Name=df$Name), FUN = sum)
#  Name Total
#1 Carl     4
#2  Joe     4

Or using data.table, convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Name' and specifying the select column in .SDcols, unlist the Subset of Data.table (.SD), convert it to a logical vector (>0) and get the sum of the TRUE values to create the summarised 'Total' column

library(data.table)
setDT(df)[, .(Total = sum(unlist(.SD)>0, na.rm = TRUE)), Name, .SDcols = L2:L4]
#   Name Total
#1: Carl     4
#2:  Joe     4

Or another option is with dplyr/tidyr. We select the columns of interest, gather into 'long' format, filter only the elements that are greater than 0, then grouped by 'Name' get the total number of rows (n())

library(dplyr)
library(tidyr)
df %>% 
    select(-L1) %>% 
    gather(Var, Val, -Name) %>% 
    filter(Val>0) %>% 
    group_by(Name) %>%
    summarise(Total = n())
# A tibble: 2 × 2
#   Name Total
#  <chr> <int>
#1  Carl     4
#2   Joe     4

Upvotes: 2

alistaire
alistaire

Reputation: 43354

With aggregate, you'll need to set both the na.rm parameter of sum, plus the na.action parameter of aggregate itself. After that, it's easy to add the three columns:

df_sums <- aggregate(. ~ Name, df, FUN = function(x) {
    sum(x > 0, na.rm = TRUE)
}, na.action = na.pass)

df_sums$sum_L2_L3_L4 <- with(df_sums, L1 + L2 + L3)

df_sums
##   Name L1 L2 L3 L4 sum_L2_L3_L4
## 1 Carl  1  1  1  2            4
## 2  Joe  2  1  2  1            4

or in dplyr,

library(dplyr)

df %>% group_by(Name) %>% 
    summarise_all(funs(sum(. > 0, na.rm = TRUE))) %>% 
    mutate(sum_L2_L3_L4 = L2 + L3 + L4)

## # A tibble: 2 × 6
##     Name    L1    L2    L3    L4 sum_L2_L3_L4
##   <fctr> <int> <int> <int> <int>        <int>
## 1   Carl     1     1     1     2            4
## 2    Joe     2     1     2     1            4

or directly,

df %>% group_by(Name) %>% summarise(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE))
## # A tibble: 2 × 2
##     Name   sum
##   <fctr> <int>
## 1   Carl     4
## 2    Joe     4

or data.table

library(data.table)

setDT(df)[, lapply(.SD, function(x){sum(x > 0, na.rm = TRUE)}), by = Name
    ][, sum_L2_L3_L4 := L2 + L3 + L4, by = Name][]

##    Name L1 L2 L3 L4 sum_L2_L3_L4
## 1: Carl  1  1  1  2            4
## 2:  Joe  2  1  2  1            4

or directly,

setDT(df)[, .(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE)), by = Name]

##    Name sum
## 1: Carl   4
## 2:  Joe   4

Upvotes: 3

Silence Dogood
Silence Dogood

Reputation: 3597

With plyr you could do:

library(plyr)

nonZeroDF = ddply(DF[,-2],"Name",.fun = function(x) 
data.frame(nonZeroObs=sum((x[,-1]) >0,na.rm=TRUE) ))

#  Name nonZeroObs
#1 Carl          4
#2  Joe          4

Upvotes: 1

Related Questions