Reputation: 413
Here is an example of a data set df
:
Name L1 L2 L3 L4
Carl 1 NA 0 2
Carl 0 1 4 1
Joe 3 0 3 1
Joe 2 2 1 0
I would like to create a function that would be able to tally up the number of values in columns L2, L3, and L4 that are greater than 0 as a function of some name. For example:
someFunction(Joe)
# 4
However, I have some NA
s in my columns.
I have tried using complete.cases
to remove the NA
s but I do not want to remove the entire row. I want to use aggregate
, however, I am not exactly sure how. Thanks for your help.
Upvotes: 1
Views: 2140
Reputation: 73385
We can use
colSums(df[c("L2", "L3", "L4")] > 0, na.rm = TRUE)
Or you may want a sum per person:
m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
# L2 L3 L4
#Carl 1 1 2
#Joe 1 2 1
There is something fun here. df[c("L2", "L3", "L4")] > 0
is a logical matrix (with NA
):
colSums
can work with it without trouble, rowsum
can not. So a fix is to add a 0
to this matrix to cast it to a 0-1 numerical matrix;when adding this 0
, we must do (df[c("L2", "L3", "L4")] > 0) + 0
not df[c("L2", "L3", "L4")] > 0 + 0
. The operation precedence in R means +
is prior to >
. Have a try on this toy example:
5 > 4 + 0 ## FALSE
(5 > 4) + 0 ## 1
So we want a bracket to evaluate >
first, then +
.
If you want the result to be a data frame, just cast the resulting matrix into a data frame by:
data.frame(m)
Follow-up
People stop responding, because your specific question on getting a function is less interesting than getting the summary dataset.
Well, if you still take my approach, I would define such function as:
extract <- function (person) {
m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
rowSums(m)[[person]]
}
Then you can call
extract("Joe")
# 4
extract("Carl")
# 4
Note, this is obviously not the most efficient way to write such a function. Because if you only want to extract the sum for one person, there is no need to proceed all data. We can do:
extract2 <- function (person) {
## subset data
sub <- subset(df, df$Name == person, select = c("L2", "L3", "L4"))
## get sum
sum(sub > 0, na.rm = TRUE)
}
Then you can call
extract2("Joe")
# 4
extract2("Carl")
# 4
Upvotes: 4
Reputation: 887571
We can use aggregate
with rowSums
to get the output
aggregate(cbind(Total=rowSums(df[3:5]>0,
na.rm=TRUE))~cbind(Name=df$Name), FUN = sum)
# Name Total
#1 Carl 4
#2 Joe 4
Or using data.table
, convert the 'data.frame' to 'data.table' (setDT(df)
), grouped by 'Name' and specifying the select column in .SDcols
, unlist
the Subset of Data.table (.SD
), convert it to a logical vector (>0
) and get the sum
of the TRUE values to create the summarised 'Total' column
library(data.table)
setDT(df)[, .(Total = sum(unlist(.SD)>0, na.rm = TRUE)), Name, .SDcols = L2:L4]
# Name Total
#1: Carl 4
#2: Joe 4
Or another option is with dplyr/tidyr
. We select
the columns of interest, gather
into 'long' format, filter
only the elements that are greater than 0, then grouped by 'Name' get the total number of rows (n()
)
library(dplyr)
library(tidyr)
df %>%
select(-L1) %>%
gather(Var, Val, -Name) %>%
filter(Val>0) %>%
group_by(Name) %>%
summarise(Total = n())
# A tibble: 2 × 2
# Name Total
# <chr> <int>
#1 Carl 4
#2 Joe 4
Upvotes: 2
Reputation: 43354
With aggregate
, you'll need to set both the na.rm
parameter of sum
, plus the na.action
parameter of aggregate
itself. After that, it's easy to add the three columns:
df_sums <- aggregate(. ~ Name, df, FUN = function(x) {
sum(x > 0, na.rm = TRUE)
}, na.action = na.pass)
df_sums$sum_L2_L3_L4 <- with(df_sums, L1 + L2 + L3)
df_sums
## Name L1 L2 L3 L4 sum_L2_L3_L4
## 1 Carl 1 1 1 2 4
## 2 Joe 2 1 2 1 4
or in dplyr,
library(dplyr)
df %>% group_by(Name) %>%
summarise_all(funs(sum(. > 0, na.rm = TRUE))) %>%
mutate(sum_L2_L3_L4 = L2 + L3 + L4)
## # A tibble: 2 × 6
## Name L1 L2 L3 L4 sum_L2_L3_L4
## <fctr> <int> <int> <int> <int> <int>
## 1 Carl 1 1 1 2 4
## 2 Joe 2 1 2 1 4
or directly,
df %>% group_by(Name) %>% summarise(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE))
## # A tibble: 2 × 2
## Name sum
## <fctr> <int>
## 1 Carl 4
## 2 Joe 4
or data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x){sum(x > 0, na.rm = TRUE)}), by = Name
][, sum_L2_L3_L4 := L2 + L3 + L4, by = Name][]
## Name L1 L2 L3 L4 sum_L2_L3_L4
## 1: Carl 1 1 1 2 4
## 2: Joe 2 1 2 1 4
or directly,
setDT(df)[, .(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE)), by = Name]
## Name sum
## 1: Carl 4
## 2: Joe 4
Upvotes: 3
Reputation: 3597
With plyr
you could do:
library(plyr)
nonZeroDF = ddply(DF[,-2],"Name",.fun = function(x)
data.frame(nonZeroObs=sum((x[,-1]) >0,na.rm=TRUE) ))
# Name nonZeroObs
#1 Carl 4
#2 Joe 4
Upvotes: 1