Sparky_47
Sparky_47

Reputation: 181

Adding percentages calculated by group_by to all rows in that group?

Example Code

mtcars %>%
  group_by(am, gear) %>%
  summarise (n = n()) %>%
  mutate(percent_pos = n / sum(n))

This results in:


# A tibble: 4 x 4
# Groups:   am [2]
     am  gear     n percent_pos
    <dbl> <dbl> <int>       <dbl>
1     0     3    15       0.789
2     0     4     4       0.211
3     1     4     8       0.615
4     1     5     5       0.385

I would like to add the value of 0.789 of the percent_pos column to all cars that are am = 0 via a new column in the mtcars dataframe with the label percentage_positive

Although in this example I can strictly refer to am = 0 or am =1 (using an if statement etc.), in my real dataset I will have hundreds of dates that this will apply to, making it hard to manually code each reference.

I was hoping to use the group_by function of dplyr to apply the value across all of the rows that are grouped by am, but I am having trouble with the mutate function here.

Upvotes: 1

Views: 73

Answers (3)

Justin Landis
Justin Landis

Reputation: 2071

If you wish to only have the maximum percent recorded per each am try the following

max.per <- mtcars %>%
   group_by(am, gear) %>%
   summarise (n = n()) %>%
   mutate(percent_pos = n / sum(n)) %>%
   mutate(percent_pos = max(percent_pos)) %>%
   distinct(am,gear, percent_pos)

# A tibble: 4 x 3
# Groups:   am [2]
     am  gear percent_pos
  <dbl> <dbl>       <dbl>
1     0     3       0.789
2     0     4       0.789
3     1     4       0.615
4     1     5       0.615


mtc <- merge(mtcars, max.per, by = c("am","gear"))

Upvotes: 2

hendrikvanb
hendrikvanb

Reputation: 459

I'm not sure I fully understand what you mean by the following:

I would like to add the value of 0.789 of the percent_pos column to all cars that are am = 0 via a new column in the mtcars dataframe with the label percentage_positive

The problem is that you've calculated multiple distinct percent_pos values for am == 0 (for each value of am, in fact). It therefore seems somewhat arbitrary to emphasize only the first value. Maybe this is exactly what you're after, but that is not clear from the information provided. If, on the other hand, your question is actually:

How do I map the calculated percent_pos value corresponding to each unique combination of am and gear back into the original data?

Then the following simple left_join operation should get you there:

mtcars %>%
  left_join(
    mtcars %>%
      group_by(am, gear) %>%
      summarise (n = n()) %>%
      mutate(percent_pos = n / sum(n)),
    by = c('am', 'gear')
  )

The first/last 4 rows of the resultant tibble look as follows:

    mpg cyl disp  hp drat    wt  qsec vs am gear carb  n percent_pos
1  21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  8   0.6153846
2  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  8   0.6153846
3  22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  8   0.6153846
4  21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 15   0.7894737

29 15.8   8  351 264 4.22 3.170 14.50  0  1    5    4  5   0.3846154
30 19.7   6  145 175 3.62 2.770 15.50  0  1    5    6  5   0.3846154
31 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8  5   0.3846154
31 21.4   4  121 109 4.11 2.780 18.60  1  1    4    2  8   0.6153846

This approach should be able to scale to an arbitrary number of grouping variables.

Upvotes: 2

Peter Hahn
Peter Hahn

Reputation: 158

I am not sure if I understand your question, but maybe mutate_if could be a solution. condition am==0 Peter

Upvotes: 0

Related Questions