EVN
EVN

Reputation: 31

How to count, in R, the number of records (i.e., rows, cells) in a column belonging to a combination of treatments in the same dataframe?

This is my first question in this forum and I have limited experience with R, so my apologies if the question is unclear in some way or if it is too basic.

I have a dataframe called values that consists of a sample number column, two factor variables (H and W), and several number columns (named as number intervals after a cut), as follows:

sample  H   W   (12.95,13]  (13,13.05]  (13.05,13.1]    (13.1,13.15]
130 bg  d   0   0   0   0
131 bg  d   0   0   0   0
132 bg  d   0   0   0   0
133 x   i   0   0   0   0
134 x   i   0   0   0   0
135 x   i   0   0   0   0
136 x   i   0   0   0   0
137 x   i   0   0   0   0
138 x   i   0   0   0   0
139 x   i   0   0   0   0
140 x   i   0   0   0   0
141 x   i   0   35947.65    0   0
142 x   i   0   0   0   0
143 x   i   0   0   0   0
144 x   i   0   0   0   0
145 x   i   0   0   0   73709.67
146 x   i   0   0   0   0
147 x   i   21756.63    0   32362.41    0
148 x   i   0   0   0   0
149 x   i   0   0   0   0
150 x   i   0   0   0   0
151 x   i   0   0   0   0
152 x   c   0   0   0   0
153 x   c   0   0   0   0
154 x   c   0   0   0   0
155 x   c   0   0   0   32578.03

I need to count how many rows in each of the number columns for each treatment combination and sample number have values greater than 0. I have tried the aggregate, count and sum functions but have had no success so far.

Could someone help me out with this?

Thank you!

Upvotes: 3

Views: 625

Answers (4)

user3652621
user3652621

Reputation: 3634

With data table (and assuming df is your data frame):

library(data.table)
setDT(df)[`colname`>0, .N, by=list(H, W, sample)]

or

setDT(df)[`colname`>0, .N, by=list(H, W)]

if you don't care about sample.

Where you should replace colname with the name of the particular column you're looking at. This would be easier for me to check if you provided a reproducible example.

Upvotes: 2

SabDeM
SabDeM

Reputation: 7190

Maybe I didn't get it (my solution seems to be very simple), but I just apply on rows the sum of where columns are all different from 0. The output is a numeric vector with length equal to the number of row of you data where:

  • 0 means there is no one column with a value other than 0
  • 1 means there is at least one column with a value greater than 0 and so on...

     apply(!df[, 4:7] == 0, 1, sum)
    
    [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 2 0 0 0 0 0 0 0 1
    

Upvotes: 0

C8H10N4O2
C8H10N4O2

Reputation: 19005

#replicable example
set.seed(123)
values <- data.frame(sample=1:100,
                     a=rep(1,100),
                     b=rep(c(1,2),50),
                     v1=rbinom(100,1,.1) * runif(100),
                     v2=rbinom(100,1,.1) * runif(100),
                     v3=rbinom(100,1,.1) * runif(100)
                     )

aggregate(cbind(v1, v2, v3) ~ a + b, # apply fcn to LHS grouped by RHS
          data=values,              
          FUN=function(x) sum(x>0)  # sum of TRUE v>0's is count of v's greater than 0 
          )
#   a b v1 v2 v3
# 1 1 1  4  4  7
# 2 1 2  3  6  2

Upvotes: 0

donshikin
donshikin

Reputation: 1503

An imperfect solution using plyr (I am pretty sure dplyr package can do even better job, but I am less familiar with it)

The downside is that sums have to be calculated for each column separately - if there are 3 or 4 of them, it is OK, but for 100 intervals that wouldn't be viable.

##Generate fake data with 3 samples, 2 factors 3 levels each 
##and 3 observations per combination
df <- expand.grid(sample = letters[1:3], 
                  f1 = paste0('x', 1:3), 
                  f2 = paste0('y', 1:3))
df <- rbind(df, df, df)
nums <- matrix(rnorm(4*nrow(df)), ncol = 4)
colnames(nums) <- paste0('val_', 1:4)
nums[nums < 1] <- 0
df <- cbind(df, nums)

##Summarize
require(plyr)
ddply(df, .(sample, f1, f2), summarize, 
           sum_1 = sum(val_1 > 0),
           sum_2 = sum(val_2 > 0))

Upvotes: 0

Related Questions