Reputation: 13
I have a 60 column table with values in each column like: YES, NO, NA, NP.
A B C
1 YES NO NO
2 NA NA NA
3 YES NO YES
4 NP NP NP
I need to calculate multiple proportions, each with a distinct denominator:
Example:
Percentage of YES against YES and NO's (YES / (YES + NO)
)
Percentage of NP against all except NA's (NP / (YES + NO + NP)
)
Percentage of NA against all (NA / rows
)
Desired result from the data frame above:
%YES.A %NP.A %NA.A %YES.B
100% 33% 25% 0%
What I have tried so far:
Trying to use dplyr
with mutate
but need to create 7 aditional columns for each of the 60 columns. Four for each value and three for each percentage.
Tried revalue
, but same problem of too many columns
mydata$A.NO <- revalue(mydata$A, c("NO" = 1))
mydata$A.YES <- revalue(mydata$A,c("YES" = 1))
Hope someone could guide me to a better solution or any library I should read.
Upvotes: 1
Views: 1284
Reputation: 83215
The first step I would do, is reshape the data into long format and then calculate the percentages by group:
library(dplyr)
library(tidyr)
mydf %>%
gather(key, val) %>%
group_by(key) %>%
summarise(pYes = 100*sum(val=="YES",na.rm=TRUE)/sum(val %in% c("YES","NO"),na.rm=TRUE),
pNP = 100*sum(val=="NP",na.rm=TRUE)/sum(val %in% c("YES","NO","NP"),na.rm=TRUE),
pNA = 100*sum(is.na(val))/n())
which gives:
Source: local data frame [3 x 4]
key pYes pNP pNA
(chr) (dbl) (dbl) (dbl)
1 A 100 33.33333 25
2 B 0 33.33333 25
3 C 50 33.33333 25
You can reshape this result into the same format as below by adding gather(percentage, val, -key)
as the last step.
You could do this by summarising first and then reshaping into long format:
mydf %>%
summarise_each(funs(pYes = sum(.=="YES",na.rm=TRUE)/sum(.%in% c("YES","NO"),na.rm=TRUE),
pNP = 100*sum(.=="NP",na.rm=TRUE)/sum(. %in% c("YES","NO","NP"),na.rm=TRUE),
pNA = 100*sum(is.na(.))/n())) %>%
gather(key, val) %>%
separate(key, c("column","percentage"), sep="_")
which gives:
column percentage val
1 A pYes 100.00000
2 B pYes 0.00000
3 C pYes 50.00000
4 A pNP 33.33333
5 B pNP 33.33333
6 C pNP 33.33333
7 A pNA 25.00000
8 B pNA 25.00000
9 C pNA 25.00000
Upvotes: 2
Reputation: 31
Converting the data frame into a data table makes it about 50% faster.
dt <- data.table(df)
dt[, sapply(.SD, myfun)]
df is the original data frame and myfun is the following function provided by Heroka:
myfun <- function(x){
res <- c(YES=sum(x=="YES",na.rm=T)/sum(x %in% c("YES","NO"),na.rm=T),
NP=sum(x=="NP",na.rm=T)/length(na.omit(x)))
res
}
Upvotes: 1
Reputation: 13139
Here is a solution for you in base-R. As you only want the summary-measures, there's no need to create new columns. We just create a new summary data-object.
First, we write a custum function to calculate everything we need for one vector (column). I've done two of your examples, but you can easily expand:
myfun <- function(x){
res <- c(YES=sum(x=="YES",na.rm=T)/sum(x %in% c("YES","NO"),na.rm=T),
NP=sum(x=="NP",na.rm=T)/length(na.omit(x)))
res
}
Then we simply apply this function to all our columns using lapply:
res <- lapply(dat, myfun)
This gives us a list of vectors (you can easily change things around in the function; do you want a list, vector, dataframe or data.table?)
We can combine them:
do.call(rbind,res)
YES NP
A 1.0 0.3333333
B 0.0 0.3333333
C 0.5 0.3333333
Upvotes: 1