Reputation: 5897
I am using the dplyr library in R.
I created the following dataset:
library(dplyr)
#create data
a = rnorm(100,100,10)
b = rnorm(100,100,10)
group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.5, 0.2, 0.15, 0.15) )
#create frame
train_data = data.frame(a,b,group)
train_data$group = as.factor(train_data$group)
From here, I want to make a new variable called "diff" which records if variable "b" is bigger than variable "a":
train_data$diff = ifelse(train_data$b > train_data$a,1,0)
Now, I want to make a new variable ("perc") in the "train_data" table, which calculates:
e.g.
Using another stackoverflow post, (Compute "percent complete" within subgroups using dplyr in R?) I tried to implement this:
final_table = data.frame(train_data %>% group_by(group) %>% mutate(perc = diff/max(diff)))
But this is not giving me the desired output:
head(final_table)
a b group diff perc
1 107.19028 117.37028 D 1 1
2 105.34165 87.96513 A 0 0
3 120.21911 94.30301 C 0 0
4 98.06001 104.82173 D 1 1
5 104.54841 90.00205 B 0 0
6 90.77172 79.31384 D 0 0
7 96.22783 88.60185 D 0 0
8 113.67500 87.28380 B 0 0
9 96.82708 89.51343 C 0 0
10 115.38720 100.79550 C 0 0
11 105.30922 80.55969 C 0 0
12 114.93315 95.78172 B 0 0
13 105.20058 109.66729 C 1 1
For example, row 11 and row 13 both have "group = c", but different values of the "perc" variable. Furthermore, it doesn't seem like percentages are being calculated here.
Can someone please show me how to fix this?
Note: Is it also possible to create a table with 4 rows in which the summaries are provided? I think the Count = n()
command can be used for this?
E.g.
Group Number of Rows Perc
a 20 0.6
b 20 0.7
c 50 0.9
d 10 0.24
Or a general summary (i.e. in the whole table, what is the percentage of rows where the "diff" variable is 1?):
d = sum(train_data$diff) / count(train_data$diff)
Thanks
Upvotes: 1
Views: 303
Reputation: 6769
Please let me know if I misunderstood your questions:
library(dplyr)
#create data
a = rnorm(100,100,10)
b = rnorm(100,100,10)
group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.5, 0.2, 0.15, 0.15) )
#create frame
train_data = data.frame(a,b,group)
# Question 1
train_data %>%
group_by(group) %>%
mutate(
percent = sum(a>b)/n()
)
#> # A tibble: 100 x 4
#> # Groups: group [4]
#> a b group percent
#> <dbl> <dbl> <chr> <dbl>
#> 1 95.0 88.9 B 0.429
#> 2 96.4 95.1 A 0.35
#> 3 102. 110. A 0.35
#> 4 97.4 96.2 A 0.35
#> 5 90.7 92.7 A 0.35
#> 6 92.0 105. B 0.429
#> 7 93.8 85.1 A 0.35
#> 8 101. 102. B 0.429
#> 9 92.0 99.1 A 0.35
#> 10 77.6 87.8 B 0.429
#> # ... with 90 more rows
# Question 2
train_data %>%
group_by(group) %>%
summarize(
rows= n(),
percent = sum(a>b)/n()
)
#> # A tibble: 4 x 3
#> group rows percent
#> <chr> <int> <dbl>
#> 1 A 60 0.35
#> 2 B 21 0.429
#> 3 C 8 0.375
#> 4 D 11 0.364
Created on 2021-07-02 by the reprex package (v2.0.0)
Upvotes: 2