Silverclaw
Silverclaw

Reputation: 1396

How can I keep columns when grouping/summarizing?

So, the problem for this question is, I cannot post actual code because of an agreement I had to sign and I'm new at R and probably unable to explain that well, , but maybe someone can help me anyway...

Let's say I have some data:

A   B    C   D
F1  6.6  10  10
F1  3.1  10  10
A1  1.0  20  10
B1  3.4  20  20

So, for every A, the C and D values are the same. But I want to use dplyr to find Bmean like so:

A    Bmean   C    D
F1   4,85    10  10
A1   1.0     20  10
B1   3.4     20  20

How would I do that? My idea was to use something like

dplyr::group_by(A) %>% dplyr::summarize(Bmean = mean(B))

but C and D seem to disappear after this operation. Would it make sense to group_by all columns I want to keep? Or how would that work?

Just to clarify, I would like to use the dplyr syntax, since it's part of a bigger operation, if possible.

Upvotes: 6

Views: 15243

Answers (6)

千木郷
千木郷

Reputation: 1807

With dplyr newer than version 1.1.0, a new function called reframe() was introduced, though still labeled experimental.

Answering this question, reframe() could be helpful as the following code snippet:

tibble::tribble(
    ~A,    ~B,  ~C, ~D,
    "F1",  6.6, 10, 10,
    "F1",  3.1, 10, 10,
    "A1",  1.0, 20, 10,
    "B1",  3.4, 20, 20,
) |> dplyr::group_by(A) |> 
  dplyr::reframe(
    Bmean = mean(B),
    dplyr::across(C:D, .fns = ~unique(.x)),
  )

Upvotes: 0

Florian
Florian

Reputation: 142

I would like to add an awnser which specifically solves the problem with the use of dplyr. While I'm sure, there are more elegant ways of doing this, the following proposal can retain columns with additional descriptive variables in a summarized/aggregated data frame. Also if this is not the case the code will not work protecting you from mistakes in bigger dataframes.

library(dplyr)
library(tibble)

df <- tribble(
  ~A  , ~B , ~c , ~D ,
  "F1", 6.6, 10 , 10 ,
  "F1", 3.1, 10 , 10 ,
  "A1", 1.0, 20 , 10 ,
  "B1", 3.4, 20 , 20
)

The following code drops the columns C and D

df %>%
  group_by(A) %>%
  summarise(Bmean = mean(B)) 

This code keeps the columns C and D. Note that this only works, if there is the same variable in each row of the group. But since the variables should be retained and not have an influence in thr grouping behaviour this should be the case anyways.

df %>%
  group_by(A) %>%
  summarise(Bmean = mean(B),
            C = unique(C),
            D = unique(D))

Update:

in fact you can also include the groups in the group_by expression, if the grouping levels are not "smaller" than the grouped variable

Group1:
  A ,  B ,  C ,  D
"F1", 6.6, 10 , 10 
"F1", 3.1, 10 , 10 
Group2:
"A1", 1.0, 20 , 10 
Group3:
"B1", 3.4, 20 , 20

Note that columns C and D maintain the same value within each group. This means they could safely be used in the grouping expression and thus be retained.

So in your case also this would work:

group_by(A,C,D)

Upvotes: 6

Mick
Mick

Reputation: 121

OK, this question is a bit old but in case someone still ends up here like me, here's a solution with the same package (dplyr). Works especially well if you have many columns to keep unchanged.

df %>%
  group_by(A) %>%
  summarise(Bmean = mean(B),
            across(C:D, .fns = ~.x))

It's applying a function (that does nothing) to columns C to D (note: I have uppercase 'C' here instead of the lowercase 'c' in the original post)

Upvotes: 5

akrun
akrun

Reputation: 886938

We can use data.table

library(data.table)
setDT(df1)[,.(Bmean = mean(B)) , .(A, C, D)]
#     A  C  D Bmean
#1: F1 10 10  4.85
#2: A1 20 10  1.00
#3: B1 20 20  3.40

Upvotes: 3

user2100721
user2100721

Reputation: 3587

You can do this using base R

aggregate(data=df1,B~.,FUN = mean)

Upvotes: 4

Pj_
Pj_

Reputation: 824

You can simply use something like this:

aggregate(cbind(B, C, D) ~ A, df, mean)

Here df is your data frame with columns A, B, C and D.

Upvotes: 3

Related Questions