JORIS
JORIS

Reputation: 59

sum up rows based on row.names and condition in col.names -- R

df <- data.frame(row.names = c('1s.u1','1s.u2','2s.u1','2s.u2','6s.u1'),fjri_deu_klcea= c('0','0','0','15','23'),hfue_klcea=c('2','2','0','156','45'),dji_dhi_ghcea_jk=c('456','0','0','15','15'),jdi_jdi_ghcea=c('1','2','3','4','100'),gz7_jfu_dcea_jdi=c('5','6','3','7','56'))

df
      fjri_deu_klcea hfue_klcea dji_dhi_ghcea_jk jdi_jdi_ghcea gz7_jfu_dcea_jdi
1s.u1              0          2              456             1                5
1s.u2              0          2                0             2                6
2s.u1              0          0                0             3                3
2s.u2             15        156               15             4                7
6s.u1             23         45               15           100               56

I want to sum up df based on the cea part of the column names. So all rows with the same cea part should sum up. df should look like this

        klcea      ghcea            dcea
1s.u1      2         457               5
1s.u2      2          2                6
2s.u1      0          3                3
2s.u2      171        19               7
6s.u1      68         115              56

I thought about firstly getting a new column with the cea name called cea and then summing it up based on row.names and the respective cea with something like with(df, ave(cea, row.names(df), FUN = sum))

I do not know how to generate the new column based on a pattern in a string. I guess grepl is useful but I could not come up with something, I tried df$cea <- df[grepl(colnames(df),'cea'),] which is wrong...

Upvotes: 0

Views: 784

Answers (2)

Karthik S
Karthik S

Reputation: 11584

Using dplyr:

> df %>% rowwise() %>% mutate(klcea = sum(c_across(ends_with('klcea'))), 
+                             ghcea = sum(c_across(contains('ghcea'))),
+                             dcea = sum(c_across(contains('dcea')))) %>% 
+                     select(klcea, ghcea, dcea)
# A tibble: 5 x 3
# Rowwise: 
  klcea ghcea  dcea
  <dbl> <dbl> <dbl>
1     2   457     5
2     2     2     6
3     0     3     3
4   171    19     7
5    68   115    56

If you wish to retain row names:

> df %>% rownames_to_column('rn') %>% rowwise() %>% mutate(klcea = sum(c_across(ends_with('klcea'))), 
+                             ghcea = sum(c_across(contains('ghcea'))),
+                             dcea = sum(c_across(contains('dcea')))) %>% 
+                     select(klcea, ghcea, dcea, rn) %>% column_to_rownames('rn')
      klcea ghcea dcea
1s.u1     2   457    5
1s.u2     2     2    6
2s.u1     0     3    3
2s.u2   171    19    7
6s.u1    68   115   56
> 

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388972

Using base R, you can extract the "cea" part from the name and use it in split.default to split dataframe into columns, we can then use rowSums to sum each individual dataframe.

sapply(split.default(df, sub('.*_(.*cea).*', '\\1', names(df))), rowSums)

#      dcea ghcea klcea
#1s.u1    5   457     2
#1s.u2    6     2     2
#2s.u1    3     3     0
#2s.u2    7    19   171
#6s.u1   56   115    68

where sub part returns :

sub('.*_(.*cea).*', '\\1', names(df))
#[1] "klcea" "klcea" "ghcea" "ghcea" "dcea" 

Upvotes: 1

Related Questions