Stoof
Stoof

Reputation: 747

Get frequency across columns in R

I have data in a data frame in this format:

  grp1 grp2 grp3 grp4 result
1    0    1    0    0      1
2    1    0    0    0      0
3    0    0    0    1      1
4    0    0    0    1      1
5    1    0    0    0      0
6    0    1    0    0      1
.
.
.

Which can be generated with

set.seed(13)

groups <- c("grp1", "grp2", "grp3", "grp4", "result")

# Randomly assign each to group and a result
x <- do.call(rbind, lapply(1:50, function(x) c(sample(c(1,0,0,0), 4), sample(0:1, 1))))
df <- data.frame(x)
colnames(df) <- groups

My goal is to have the data formatted like:

  group      freq
1  grp1 0.5625000
2  grp2 0.5000000
3  grp3 0.6250000
4  grp4 0.2857143

Where the frequency is the percentage of each group that have the result.

My attempt so far using dplyr:

library(dplyr)

df %>% 
  group_by(grp1, grp2, grp3, grp4, result) %>% 
  summarize(n = n()) %>% 
  mutate(freq = n / sum(n)) %>%
  select(-n) %>%
  filter(result == 1)

results in

  grp1 grp2 grp3 grp4 result      freq
1    0    0    0    1      1 0.5625000
2    0    0    1    0      1 0.5000000
3    0    1    0    0      1 0.6250000
4    1    0    0    0      1 0.2857143

Upvotes: 3

Views: 390

Answers (5)

User7598
User7598

Reputation: 1678

You can also use apply:

> freq=apply(df,2,function(x){sum(x==1 & df$result==1)/sum(x)})
> data.frame(freq)
#             freq
# grp1   0.2857143
# grp2   0.6250000
# grp3   0.5000000
# grp4   0.5625000
# result 1.0000000

As suggested by @akrun, you could also do:

summarise_each(df,funs( sum(.==1 & df$result==1)/sum(.))) %>% t()

In this case, apply seems to provide the fastest solution:

akrun=function(df)    {summarise_each(df,funs( sum(.==1 & df$result==1)/sum(.))) %>% t()}
user7598=function(df) {apply(df,2,function(x){sum(x==1 & df$result==1)/sum(x)})}
David=function(df)    {melt(setDT(df), "result")[, .(freq = sum(value[result == 1])/sum(value)), by = variable]}
Gregor=function(df)   {df %>% tidyr::gather(key = group, value = group_choice, grp1:grp4) %>% group_by(group) %>% filter(group_choice == 1) %>% summarize(freq = mean(group_choice == result))}

# SPEED TESTS
set.seed(5)
microbenchmark(akrun(df), Gregor(df),user7598(df),David(df))
Unit: microseconds
         expr       min         lq       mean    median         uq       max neval cld
    akrun(df)  9645.860 10509.3940 12690.5538 10848.248 12315.4020 98239.948   100   c
   Gregor(df) 10319.888 11405.6060 12512.9027 11685.120 12237.1120 26211.999   100   c
 user7598(df)   423.662   491.7045   630.8143   563.958   629.8315  2027.243   100   a  
    David(df)  2115.610  2273.5525  2622.7699  2348.005  2475.2295 15491.534   100   b 

Note changes based on comments by OP in @Gregor's answer.

Upvotes: 3

bgoldst
bgoldst

Reputation: 35314

I think colSums() is appropriate here:

rci <- which(names(df)=='result');
data.frame(group=names(df[-rci]),freq=unname(colSums(df[-rci]==1&df[,rci]==1)/colSums(df[-rci])));
##   group      freq
## 1  grp1 0.2857143
## 2  grp2 0.6250000
## 3  grp3 0.5000000
## 4  grp4 0.5625000

Upvotes: 0

David Arenburg
David Arenburg

Reputation: 92282

Here's a data.table attempt

library(data.table)
melt(setDT(df), "result")[, .(freq = sum(value[result == 1])/sum(value)), by = variable]
#    variable      freq
# 1:     grp1 0.2857143
# 2:     grp2 0.6250000
# 3:     grp3 0.5000000
# 4:     grp4 0.5625000

Upvotes: 6

JFu
JFu

Reputation: 121

If I understand correctly, you're looking to know the percentage of each group with a "1", conditional on "Result" being 1. If that's the case, then you can use the apply() function to sum over columns, then divide by the length of the column. You can apply the constraint that "Result" must equal 1 by specifying in the data frame.

Notice that in the data frame specification below, I tell R to use apply only for the first four columns, since the "Result" column doesn't need to be incorporated as part of the frequency calculation.

namely: df[conditional statement, c(1:4)]

result <- data.frame(apply(df[df$result == 1, c(1:4)], 2,sum)/apply(df[df$result==1, c(1:4) ], 2, length))
colnames(result)<- c("freq")

This produces results in the format below

      freq
grp1 0.1818182
grp2 0.1818182
grp3 0.3636364
grp4 0.2727273

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 145765

By "the frequency is the percentage of each group that have the result" I assume you mean the percentage of rows where each group is equal to the result.

df %>% 
    tidyr::gather(key = group, value = group_choice, grp1:grp4) %>%
    group_by(group) %>%
    filter(group_choice == 1) %>%
    summarize(freq = mean(group_choice == result))

# Source: local data frame [4 x 2]
# 
#   group      freq
# 1  grp1 0.2857143
# 2  grp2 0.6250000
# 3  grp3 0.5000000
# 4  grp4 0.5625000

Upvotes: 4

Related Questions