Valentin_Ștefan
Valentin_Ștefan

Reputation: 6436

Group-by in data.table with choosing first element in multiple columns

Given the data table:

DT <- data.table(C1=1:6,C2=10:15,C3=1:3,C4=c("A","B")) 

   C1 C2 C3 C4
1:  1 10  1  A
2:  2 11  2  B
3:  3 12  3  A
4:  4 13  1  B
5:  5 14  2  A
6:  6 15  3  B

If I want to group by C4, compute the mean on C3 and display the first element in C1, I would do this:

DT[, list(C1=C1[1], AVG_C3=mean(C3), Freq=.N), by=C4]
   C4 C1 AVG_C3 Freq
1:  A  1      2    3
2:  B  2      2    3

If I would like to get the first element in C2, then I add C2[1] in the above line:

DT[, list(C1=C1[1], C2=C2[1], AVG_C3=mean(C3), Freq=.N), by=C4]

How would I tackle this task given the fact that I have hundreds of columns and I wish to see the first element as explained above?

Maybe there is some simple way like working with a vector of desired columns cols <- c("C1","C2"), but I can't figure it out how to assign to cols within DT[]. I had a look at assigning to multiple columns here Assign multiple columns using := in data.table, by group , but I'm not sure how to implement it for this case. I would appreciate your feedback :)

Upvotes: 2

Views: 842

Answers (2)

akrun
akrun

Reputation: 887213

We can do the summary statistics on specified columns by 'C4' and join on the original dataset to get the corresponding rows.

DT[DT[, .(Avg_C3= mean(C3), freq= .N, C1 = C1[1L]) , by = C4], on = c("C4", "C1")]
#   C1 C2 C3 C4 Avg_C3 freq
#1:  1 10  1  A      2    3
#2:  2 11  2  B      2    3

Or an option using dplyr

library(dplyr)
DT %>% 
  group_by(C4) %>%
  mutate(Avg_C3 = mean(C3), freq = n(), C1 = first(C1)) %>% 
  slice(1L)
#      C1    C2    C3    C4 Avg_C3  freq
#   <int> <int> <int> <chr>  <dbl> <int>
#1     1    10     1     A      2     3
#2     2    11     2     B      2     3

Upvotes: 2

akuiper
akuiper

Reputation: 214977

Suppose you want to see the first element of C1, C2 and C3 respectively, you can use the head on the .SD and specify the column names using .SDcols.

cols <- c("C1", "C2", "C3")
DT[, c(head(.SD, 1), list(AVG_C3=mean(C3), Freq=.N)), by=C4, .SDcols = cols]

   C4 C1 C2 C3 AVG_C3 Freq
1:  A  1 10  1      2    3
2:  B  2 11  2      2    3

Upvotes: 4

Related Questions