Reputation: 13

Calculate column-wise proportions of different elements with different denominators

I have a 60 column table with values in each column like: YES, NO, NA, NP.

    A   B   C
1  YES NO  NO
2  NA  NA  NA
3 YES NO  YES
4  NP  NP  NP

I need to calculate multiple proportions, each with a distinct denominator:

Example:
Percentage of YES against YES and NO's (YES / (YES + NO))
Percentage of NP against all except NA's (NP / (YES + NO + NP))
Percentage of NA against all (NA / rows)

Desired result from the data frame above:

%YES.A  %NP.A  %NA.A  %YES.B
  100%    33%    25%    0%

What I have tried so far:
Trying to use dplyr with mutate but need to create 7 aditional columns for each of the 60 columns. Four for each value and three for each percentage.

Tried revalue, but same problem of too many columns

mydata$A.NO <- revalue(mydata$A, c("NO" = 1))  
mydata$A.YES <- revalue(mydata$A,c("YES" = 1))

Hope someone could guide me to a better solution or any library I should read.

Upvotes: 1

Answers (3)

Jaap

Reputation: 83215

The first step I would do, is reshape the data into long format and then calculate the percentages by group:

library(dplyr)
library(tidyr)

mydf %>% 
  gather(key, val) %>%
  group_by(key) %>% 
  summarise(pYes = 100*sum(val=="YES",na.rm=TRUE)/sum(val %in% c("YES","NO"),na.rm=TRUE),
            pNP = 100*sum(val=="NP",na.rm=TRUE)/sum(val %in% c("YES","NO","NP"),na.rm=TRUE),
            pNA = 100*sum(is.na(val))/n())

which gives:

Source: local data frame [3 x 4]

    key  pYes      pNP   pNA
  (chr) (dbl)    (dbl) (dbl)
1     A   100 33.33333    25
2     B     0 33.33333    25
3     C    50 33.33333    25

You can reshape this result into the same format as below by adding gather(percentage, val, -key) as the last step.

You could do this by summarising first and then reshaping into long format:

mydf %>% 
  summarise_each(funs(pYes = sum(.=="YES",na.rm=TRUE)/sum(.%in% c("YES","NO"),na.rm=TRUE),
                      pNP = 100*sum(.=="NP",na.rm=TRUE)/sum(. %in% c("YES","NO","NP"),na.rm=TRUE),
                      pNA = 100*sum(is.na(.))/n())) %>%
  gather(key, val) %>%
  separate(key, c("column","percentage"), sep="_")

which gives:

  column percentage       val
1      A       pYes 100.00000
2      B       pYes   0.00000
3      C       pYes  50.00000
4      A        pNP  33.33333
5      B        pNP  33.33333
6      C        pNP  33.33333
7      A        pNA  25.00000
8      B        pNA  25.00000
9      C        pNA  25.00000

Upvotes: 2

Erol

Reputation: 31

Converting the data frame into a data table makes it about 50% faster.

dt <- data.table(df)
dt[, sapply(.SD, myfun)]

df is the original data frame and myfun is the following function provided by Heroka:

myfun <- function(x){
  res <- c(YES=sum(x=="YES",na.rm=T)/sum(x %in% c("YES","NO"),na.rm=T),
                    NP=sum(x=="NP",na.rm=T)/length(na.omit(x)))
  res
}

Upvotes: 1

Heroka

Reputation: 13139

Here is a solution for you in base-R. As you only want the summary-measures, there's no need to create new columns. We just create a new summary data-object.

First, we write a custum function to calculate everything we need for one vector (column). I've done two of your examples, but you can easily expand:

myfun <- function(x){
  res <- c(YES=sum(x=="YES",na.rm=T)/sum(x %in% c("YES","NO"),na.rm=T),
                    NP=sum(x=="NP",na.rm=T)/length(na.omit(x)))
  res
}

Then we simply apply this function to all our columns using lapply:

res <- lapply(dat, myfun)

This gives us a list of vectors (you can easily change things around in the function; do you want a list, vector, dataframe or data.table?)

We can combine them:

do.call(rbind,res)
  YES        NP
A 1.0 0.3333333
B 0.0 0.3333333
C 0.5 0.3333333

Upvotes: 1

Calculate column-wise proportions of different elements with different denominators

Answers (3)

Related Questions