ShanZhengYang
ShanZhengYang

Reputation: 17621

R data.table: How to sum variables by group based on a condition?

Let's say I have the following R data.table (though I'm happy to work with base R, data.frame as well)

library(data.table)

dt = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3), times = c(0, 0, 0, 3, 3, 1))

> dt
   Category Frequency times
1:    First        10     0
2:    First        15     0
3:    First         5     0
4:   Second         2     3
5:    Third        14     3
6:    Third        20     1
7:   Second         3     0

If I wished to sum the Frequencies by Category, I would use the following:

data[, sum(Frequency), by = Category]

However, let's say I wanted to sum Frequency by Category if and only if times is non-zero and not equal to NA?

How would one make this sum a conditional based on the values of a separate column?

EDIT: apologies for the obvious question. A quick addition: what about if the elements of a certain column are strings?

e.g.

> dt
   Category Frequency times
1:    First        ten    0
2:    First        ten    0
3:    First        five   0
4:   Second        five   3
5:    Third        five   3
6:    Third        five   1
7:   Second        ten    0

Sum() will not calculate the frequencies of ten versus five

Upvotes: 2

Views: 16433

Answers (3)

Prasan Karunarathna
Prasan Karunarathna

Reputation: 385

You can use rowsum() for this.

rowsum

Give Column Sums of a Matrix or Data Frame, Based on a Grouping Variable

Compute column sums across rows of a numeric matrix-like object for each level of a grouping variable. rowsum is generic, with a method for data frames and a default method for vectors and matrices.

Keywords: manip

Usage

rowsum(x, group, reorder = TRUE, …)

S3 method for data.frame

rowsum(x, group, reorder = TRUE, na.rm = FALSE, …)

S3 method for default

rowsum(x, group, reorder = TRUE, na.rm = FALSE, …)

Arguments a matrix, data frame or vector of numeric data. Missing values are allowed. A numeric vector will be treated as a column vector. group

a vector or factor giving the grouping, with one element per row of x. Missing values will be treated as another group and a warning will be given.

reorder

if TRUE, then the result will be in order of sort(unique(group)), if FALSE, it will be in the order that groups were encountered.

na.rm

logical (TRUE or FALSE). Should NA (including NaN) values be discarded?

other arguments to be passed to or from methods

Details

The default is to reorder the rows to agree with tapply as in the example below. Reordering should not add noticeably to the time except when there are very many distinct values of group and x has few columns.

The original function was written by Terry Therneau, but this is a new implementation using hashing that is much faster for large matrices.

To sum over all the rows of a matrix (ie, a single group) use colSums, which should be even faster.

For integer arguments, over/underflow in forming the sum results in NA.

Value

A matrix or data frame containing the sums. There will be one row per unique value of

Upvotes: 1

Andrew Brēza
Andrew Brēza

Reputation: 8317

You can use bracket subsetting to select only the rows with non-zero and non-NA values for times and then run your grouping operation.

dt[which(dt$times > 0)][, sum(Frequency), by = Category]

Upvotes: 1

s_baldur
s_baldur

Reputation: 33498

Remember the logic of data.table: dt[i, j, by], that is take dt, subset rows using i, then calculate j grouped by by.

dt[times != 0 & !is.na(times), sum(Frequency), by = Category]
   Category V1
1:   Second  2
2:    Third 34

Upvotes: 5

Related Questions