stats_noob
stats_noob

Reputation: 5897

R: can dplyr calculate percents by group?

I am using the dplyr library in R.

I created the following dataset:

library(dplyr)
 #create data
 a = rnorm(100,100,10)
 b = rnorm(100,100,10)
group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.5, 0.2, 0.15, 0.15) )

#create frame
 train_data = data.frame(a,b,group)
train_data$group = as.factor(train_data$group)
 

From here, I want to make a new variable called "diff" which records if variable "b" is bigger than variable "a":

train_data$diff = ifelse(train_data$b > train_data$a,1,0)

Now, I want to make a new variable ("perc") in the "train_data" table, which calculates:

e.g.

Using another stackoverflow post, (Compute "percent complete" within subgroups using dplyr in R?) I tried to implement this:

final_table = data.frame(train_data %>% group_by(group) %>% mutate(perc = diff/max(diff)))

But this is not giving me the desired output:

head(final_table)

               a         b group diff perc
1   107.19028 117.37028     D    1             1
2   105.34165  87.96513     A    0             0
3   120.21911  94.30301     C    0             0
4    98.06001 104.82173     D    1             1
5   104.54841  90.00205     B    0             0
6    90.77172  79.31384     D    0             0
7    96.22783  88.60185     D    0             0
8   113.67500  87.28380     B    0             0
9    96.82708  89.51343     C    0             0
10  115.38720 100.79550     C    0             0
11  105.30922  80.55969     C    0             0
12  114.93315  95.78172     B    0             0
13  105.20058 109.66729     C    1             1

For example, row 11 and row 13 both have "group = c", but different values of the "perc" variable. Furthermore, it doesn't seem like percentages are being calculated here.

Can someone please show me how to fix this?

Note: Is it also possible to create a table with 4 rows in which the summaries are provided? I think the Count = n() command can be used for this?

E.g.

Group  Number of Rows    Perc
a            20           0.6
b            20           0.7
c            50           0.9
d            10           0.24

Or a general summary (i.e. in the whole table, what is the percentage of rows where the "diff" variable is 1?):

d = sum(train_data$diff) / count(train_data$diff)

Thanks

Upvotes: 1

Views: 303

Answers (1)

Zhiqiang Wang
Zhiqiang Wang

Reputation: 6769

Please let me know if I misunderstood your questions:

library(dplyr)

#create data
a = rnorm(100,100,10)
b = rnorm(100,100,10)
group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.5, 0.2, 0.15, 0.15) )

#create frame
train_data = data.frame(a,b,group)


# Question 1
train_data %>% 
  group_by(group) %>% 
  mutate(
    percent = sum(a>b)/n()
  )
#> # A tibble: 100 x 4
#> # Groups:   group [4]
#>        a     b group percent
#>    <dbl> <dbl> <chr>   <dbl>
#>  1  95.0  88.9 B       0.429
#>  2  96.4  95.1 A       0.35 
#>  3 102.  110.  A       0.35 
#>  4  97.4  96.2 A       0.35 
#>  5  90.7  92.7 A       0.35 
#>  6  92.0 105.  B       0.429
#>  7  93.8  85.1 A       0.35 
#>  8 101.  102.  B       0.429
#>  9  92.0  99.1 A       0.35 
#> 10  77.6  87.8 B       0.429
#> # ... with 90 more rows

# Question 2
train_data %>% 
  group_by(group) %>% 
  summarize(
    rows= n(),
    percent = sum(a>b)/n()
  )
#> # A tibble: 4 x 3
#>   group  rows percent
#>   <chr> <int>   <dbl>
#> 1 A        60   0.35 
#> 2 B        21   0.429
#> 3 C         8   0.375
#> 4 D        11   0.364

Created on 2021-07-02 by the reprex package (v2.0.0)

Upvotes: 2

Related Questions