KT_1
KT_1

Reputation: 8474

Calculate percentages of a binary variable BY another variable in R

I want to summarise the percentage of people that have been treated BY region.

I have created a dummy dataset for this purpose:

id <- seq(1:1000)
region <- rep(c("A","B","C","D","E"),c(200,200,200,200,200))
treatment <- rep(seq(1:2), each=4)
d <- data.frame(id,region,treatment)

How would I find out (a) the total number of people in each region (I presume I would use length for this purpose) and (b) the percentage of people who had treatment 1 (as oppose to 2) BY region?

I will have NAs for some of the IDs, so if this could be incorporated in the code from the outset, that would be appreciated.

I have used ddply in the past to summarise a continuous variable (i.e. the mean) but am struggling when using a factor variable.

Any help would be gratefully appreciated.

Upvotes: 2

Views: 7716

Answers (4)

Sam Dickson
Sam Dickson

Reputation: 5239

For completeness, here's how you can do it using ddply() from plyr:

library(plyr)
ddply(d[!is.na(d$id),],.(region),summarize,
      N = length(region),
      prop=mean(treatment==1))
#   region   N prop
# 1      A 200  0.5
# 2      B 200  0.5
# 3      C 200  0.5
# 4      D 200  0.5
# 5      E 200  0.5

This assumes that you want to deal with the NA values in id by removing the observation.

Upvotes: 0

PavoDive
PavoDive

Reputation: 6496

A dplyr solution:

library(dplyr)
d %>% group_by(region) %>% summarize(NumPat=n(),prop=sum(treatment==1)/n())

What we do here is group by region and then pipe it to summarize by the number of patients in each group, and then calculate the proportion of those patients that received treatment 1.

Upvotes: 4

Heroka
Heroka

Reputation: 13139

You could also use data.table:

library(data.table)

setDT(d)[,.(.N,prop=sum(treatment==2)/.N),
         by=region]
   region   N prop
1:      A 200  0.5
2:      B 200  0.5
3:      C 200  0.5
4:      D 200  0.5
5:      E 200  0.5

Upvotes: 2

swolf
swolf

Reputation: 1143

If I understand the question correctly, this can be very easily (and fast!) done with table and prop.table:

prop.table(table(d$treatment, d$region))

This gives you the percentages of each cell. If you want to get row- or column-wise percentages, you want to make use of the margin parameter in prop.table:

prop.table(table(d$treatment, d$region), margin = 2) # column-wise
prop.table(table(d$treatment, d$region), margin = 1) # row-wise

Upvotes: 1

Related Questions