JohnY
JohnY

Reputation: 21

Group columns and sum values in R

I have a data set which looks like this

Variable    A1  A2  A3  B1  B2  B3  
Item 1  Men 1   3   3   1   2   3
Item 2  Men 1   3   3   1   2   3
Item 1  Men 1   3   3   1   2   3   
Item 3  Men 2   1   2   5   3   3
Item 2  Men 1   3   3   1   2   3

I need to group columns A1, A2, A3 to be A and B1,B2,B3 to be B, thereafter I would like to sum the values.

How do I do this in R ?

Upvotes: 2

Views: 194

Answers (3)

talat
talat

Reputation: 70256

If you only have to do this for a few variables, you could do:

df$A <- rowSums(df[grep("^A\\d+$", names(df))])
df$B <- rowSums(df[grep("^B\\d+$", names(df))])
df[!names(df) %in% grep("^[A-Z]{1}\\d+$", names(df), value = TRUE)]
df
#  Variable Sex A  B
#1   Item_1 Men 7  6
#2   Item_2 Men 7  6
#3   Item_1 Men 7  6
#4   Item_3 Men 5 11
#5   Item_2 Men 7  6

(I called the "MEN" column Sex since it seemed like there were too few column names).

The grep() command I wrote above uses regular expressions. For example, "^A\\d+$" matches those column names that start with "A" followed by any number of digits. The grep command returns the indices where those matches were found (in the column names) and therefore the rowSums will only add up the right columns.

A very similar concept is used in the third line to drop the columns which are of the format: Any single capital letter (A to Z) followed by any number of digits, will be dropped.

Upvotes: 2

akrun
akrun

Reputation: 886998

If you have many groups, you could convert the wide format to long format and then do the necessary calculation. Here is one way using dplyr

library(dplyr)
library(tidyr)

df$indx <- 1:nrow(df)

df %>% 
   gather(Var, Val, -Variable,-indx) %>%
   extract(Var, c('Var1', 'Var2'), '(.)(.)') %>% 
   group_by(Variable, indx, Var1) %>%
   summarise(Val=sum(Val)) %>%
   ungroup() %>%
   spread(Var1, Val) %>% 
   arrange(indx)

#     Variable indx A  B
#1 Item 1  Men    1 7  6
#2 Item 2  Men    2 7  6
#3 Item 1  Men    3 7  6
#4 Item 3  Men    4 5 11
#5 Item 2  Men    5 7  6

Or another option would be to use Map with rowSums

 indx <- sub('\\d+', '', colnames(df)[-1])
 lst <- split(colnames(df)[-1],indx)
 df1 <- df[1]
 Un <- unique(indx)
 df1[Un] <- Map(function(x,y) rowSums(x[y], na.rm=TRUE),
                                                     list(df), lst)
 df1
 #     Variable A  B
 #1 Item 1  Men 7  6
 #2 Item 2  Men 7  6
 #3 Item 1  Men 7  6
 #4 Item 3  Men 5 11
 #5 Item 2  Men 7  6

Or if the columns are ordered ie. A1, A2,A3 followed by B1, B2, B3 etc. and the number of columns per each group is the same. (Change the dim accordingly)

 df1[Un] <- apply(array(as.matrix(df[-1]), 
                              dim=c(nrow(df),3,length(Un))), c(1,3), sum)

data

df <-  structure(list(Variable = c("Item 1  Men", "Item 2  Men", "Item 1  Men", 
"Item 3  Men", "Item 2  Men"), A1 = c(1L, 1L, 1L, 2L, 1L), A2 = c(3L, 
3L, 3L, 1L, 3L), A3 = c(3L, 3L, 3L, 2L, 3L), B1 = c(1L, 1L, 1L, 
5L, 1L), B2 = c(2L, 2L, 2L, 3L, 2L), B3 = c(3L, 3L, 3L, 3L, 3L
)), .Names = c("Variable", "A1", "A2", "A3", "B1", "B2", "B3"
), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 0

alias_paj
alias_paj

Reputation: 158

test.df$A <- test.df$A1 + test.df$A2 + test.df$A3

and the same for B. This code will just create a new variable in your dataframe named A with the calculation.

Upvotes: 0

Related Questions