Reputation: 747
I have data in a data frame in this format:
grp1 grp2 grp3 grp4 result
1 0 1 0 0 1
2 1 0 0 0 0
3 0 0 0 1 1
4 0 0 0 1 1
5 1 0 0 0 0
6 0 1 0 0 1
.
.
.
Which can be generated with
set.seed(13)
groups <- c("grp1", "grp2", "grp3", "grp4", "result")
# Randomly assign each to group and a result
x <- do.call(rbind, lapply(1:50, function(x) c(sample(c(1,0,0,0), 4), sample(0:1, 1))))
df <- data.frame(x)
colnames(df) <- groups
My goal is to have the data formatted like:
group freq
1 grp1 0.5625000
2 grp2 0.5000000
3 grp3 0.6250000
4 grp4 0.2857143
Where the frequency is the percentage of each group that have the result.
My attempt so far using dplyr:
library(dplyr)
df %>%
group_by(grp1, grp2, grp3, grp4, result) %>%
summarize(n = n()) %>%
mutate(freq = n / sum(n)) %>%
select(-n) %>%
filter(result == 1)
results in
grp1 grp2 grp3 grp4 result freq
1 0 0 0 1 1 0.5625000
2 0 0 1 0 1 0.5000000
3 0 1 0 0 1 0.6250000
4 1 0 0 0 1 0.2857143
Upvotes: 3
Views: 390
Reputation: 1678
You can also use apply
:
> freq=apply(df,2,function(x){sum(x==1 & df$result==1)/sum(x)})
> data.frame(freq)
# freq
# grp1 0.2857143
# grp2 0.6250000
# grp3 0.5000000
# grp4 0.5625000
# result 1.0000000
As suggested by @akrun, you could also do:
summarise_each(df,funs( sum(.==1 & df$result==1)/sum(.))) %>% t()
In this case, apply
seems to provide the fastest solution:
akrun=function(df) {summarise_each(df,funs( sum(.==1 & df$result==1)/sum(.))) %>% t()}
user7598=function(df) {apply(df,2,function(x){sum(x==1 & df$result==1)/sum(x)})}
David=function(df) {melt(setDT(df), "result")[, .(freq = sum(value[result == 1])/sum(value)), by = variable]}
Gregor=function(df) {df %>% tidyr::gather(key = group, value = group_choice, grp1:grp4) %>% group_by(group) %>% filter(group_choice == 1) %>% summarize(freq = mean(group_choice == result))}
# SPEED TESTS
set.seed(5)
microbenchmark(akrun(df), Gregor(df),user7598(df),David(df))
Unit: microseconds
expr min lq mean median uq max neval cld
akrun(df) 9645.860 10509.3940 12690.5538 10848.248 12315.4020 98239.948 100 c
Gregor(df) 10319.888 11405.6060 12512.9027 11685.120 12237.1120 26211.999 100 c
user7598(df) 423.662 491.7045 630.8143 563.958 629.8315 2027.243 100 a
David(df) 2115.610 2273.5525 2622.7699 2348.005 2475.2295 15491.534 100 b
Note changes based on comments by OP in @Gregor's answer.
Upvotes: 3
Reputation: 35314
I think colSums()
is appropriate here:
rci <- which(names(df)=='result');
data.frame(group=names(df[-rci]),freq=unname(colSums(df[-rci]==1&df[,rci]==1)/colSums(df[-rci])));
## group freq
## 1 grp1 0.2857143
## 2 grp2 0.6250000
## 3 grp3 0.5000000
## 4 grp4 0.5625000
Upvotes: 0
Reputation: 92282
Here's a data.table
attempt
library(data.table)
melt(setDT(df), "result")[, .(freq = sum(value[result == 1])/sum(value)), by = variable]
# variable freq
# 1: grp1 0.2857143
# 2: grp2 0.6250000
# 3: grp3 0.5000000
# 4: grp4 0.5625000
Upvotes: 6
Reputation: 121
If I understand correctly, you're looking to know the percentage of each group with a "1", conditional on "Result" being 1. If that's the case, then you can use the apply() function to sum over columns, then divide by the length of the column. You can apply the constraint that "Result" must equal 1 by specifying in the data frame.
Notice that in the data frame specification below, I tell R to use apply only for the first four columns, since the "Result" column doesn't need to be incorporated as part of the frequency calculation.
namely: df[conditional statement, c(1:4)]
result <- data.frame(apply(df[df$result == 1, c(1:4)], 2,sum)/apply(df[df$result==1, c(1:4) ], 2, length))
colnames(result)<- c("freq")
This produces results in the format below
freq
grp1 0.1818182
grp2 0.1818182
grp3 0.3636364
grp4 0.2727273
Upvotes: 0
Reputation: 145765
By "the frequency is the percentage of each group that have the result" I assume you mean the percentage of rows where each group is equal to the result.
df %>%
tidyr::gather(key = group, value = group_choice, grp1:grp4) %>%
group_by(group) %>%
filter(group_choice == 1) %>%
summarize(freq = mean(group_choice == result))
# Source: local data frame [4 x 2]
#
# group freq
# 1 grp1 0.2857143
# 2 grp2 0.6250000
# 3 grp3 0.5000000
# 4 grp4 0.5625000
Upvotes: 4