operating between columns and classifing values per groups R

Question

I try to obtain percentages grouping values regarding one variable.

For this I used sapply to obtain the percentage of each column regarding another one, but I dont know how to group these values by type (another variable)

x <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
                "type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))
x
  A B C type yes
1 0 0 1    x   0
2 0 1 0    x   0
3 1 0 1    x   1
4 1 1 1    y   1
5 1 0 0    y   0
6 1 1 0    y   1
7 1 1 1    x   1

I need to obtaing the next value (percentage): A==1&yes==1/A==1, and for this I use the next code:

result <- as.data.frame(sapply(x[,1:3],
                             function(i) (sum(i & x$yes)/sum(i))*100))
result
  sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A                                                          80
B                                                          75
C                                                          75

Now I need to obtain the same math operation but taking into account the varible "type". It means, obtaing the same percentage but discriminating it by type. So, my expected table was:

   type   sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A  x      40             
A  y      40                                                  
B  x      25
B  y      50                                                  
C  x      50
C  y      25

In the example it's possible to observe that, by letters, the percentage sum is the same value that the obtained in the first result, just here is discriminated by type. thanks a lot.

JDG · Accepted Answer

You can do the following using data.table:

Code

setDT(df)
cols = c('A', 'B', 'C')

mat = df[yes == 1, lapply(.SD, function(x){

  100 * sum(x)/df[, lapply(.SD, sum), .SDcols = cols][[substitute(x)]]

  # Here, the numerator is sum(x | yes == 1) for x == columns A, B, C
  # If we look at the denominator, it equals sum(x) for x == columns A, B, C
  # The reason why we need to apply substitute(x) is because df[, lapply(.SD, sum)]
  # generates a list of column sums, i.e. list(A = sum(A), B = sum(B), ...). 
  # Hence, for each x in the column names we must subset the list above using [[substitute(x)]]
  # Ultimately, the operation equals sum(x | yes == 1)/sum(x) for A, B, C.

}), .(type), .SDcols = cols] 

# '.(type)' simply means that we apply this for each type group, 
# i.e. once for x and once for y, for each ABC column. 
# The dot is just shorthand for 'list()'.
# .SDcols assigns the subset that I want to apply my lapply statement onto.

Result

> mat
   type  A  B  C
1:    x 40 25 50
2:    y 40 50 25

Long format (your example)

> melt(mat)
   type variable value
1:    x        A    40
2:    y        A    40
3:    x        B    25
4:    y        B    50
5:    x        C    50
6:    y        C    25

Data

df <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
                "type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))

operating between columns and classifing values per groups R

Answers (1)

Related Questions