user14375619
user14375619

Reputation: 39

Aggregating variables for cases

Hello fellow Overflowers,

the goal is to process certain steps of data manipulation on a fairy big dataset. In a first step, certain variables, which represent different cases of a certain information, shall be aggregated for each case. There are always 5 variables to aggregate.

Right now, the dataset looks like this:

      a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 ... xyz5 A B C 
case1 3  4  7  9  6  21 13              4    1 7 8 
case2 9  12 8        17 25 31           7    2 7 6
case3 5  3  11 10    32 19 13           5    1 6 8
...

It should somehow look like this

      mean-a  mean-b ...mean-xyz A B C 
case1 5,8     17        6,4      1 7 8 
case2 9,6     24,3      8,3      2 7 6
case3 7,25    21,3      7        1 6 8
...

I'm not sure if building a function or using the acrossfunction from the dplyr package is the right way to do it, since there are about 2000 variables which need to be aggregated.

Any help will be greatly appreciated.

Thanks a lot in advance!

Upvotes: 1

Views: 46

Answers (2)

Anoushiravan R
Anoushiravan R

Reputation: 21938

You can also use the following solution:

library(dplyr)
library(stringr)
library(purrr)

# First we extract the unique letter values of column names
letters <- unique(str_remove(names(df), "\\d"))
[1] "a" "b"   
  

# Then we iterate over each unique values and extract the columns that contain that unique letter

letters %>%
  map(~ df %>% 
        select(contains(.x)) %>% 
        rowwise() %>%
        mutate("mean_{.x}" := mean(c_across(contains(.x)), na.rm = TRUE))) %>%
  bind_cols() %>%
  relocate(contains("mean"), .after = last_col())


# A tibble: 3 x 12
# Rowwise: 
     a1    a2    a3    a4    a5    b1    b2    b3    b4    b5 mean_a mean_b
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
1     3     4     7     9     6    21    13     7     8     4    5.8   10.6
2     9    12     8    17    25    31     4     2     2     7   14.2    9.2
3     5     3    11    10    32    19    13     2     2     5   12.2    8.2

Data

df <- tribble(
  ~a1, ~a2, ~a3, ~a4, ~a5, ~b1, ~b2, ~b3, ~b4, ~b5,
 3, 4, 7, 9, 6, 21, 13, 7, 8, 4, 
 9, 12, 8, 17, 25, 31, 4, 2, 2, 7,
 5, 3, 11, 10, 32, 19, 13, 2, 2, 5
)

Upvotes: 1

Ben373
Ben373

Reputation: 971

Example Data:

# toy data
library(data.table)
m <- matrix(1:30, ncol = 10)
colnames(m) <- c(paste0('a', 1:5), paste0('b', 1:5))
d <- data.table(m)
d
   a1 a2 a3 a4 a5 b1 b2 b3 b4 b5
1:  1  4  7 10 13 16 19 22 25 28
2:  2  5  8 11 14 17 20 23 26 29
3:  3  6  9 12 15 18 21 24 27 30

Determine Groups:

You can determine first the groups you want to aggregate.

groups <- split(colnames(d), gsub("\\d", "", colnames(d)))
groups
$a
[1] "a1" "a2" "a3" "a4" "a5"

$b
[1] "b1" "b2" "b3" "b4" "b5"

Aggregate

Afterwards you simply calculate the mean of each group.

> d[,lapply(groups, function(i) {rowMeans(d[, i, with = FALSE])})]
   a  b
1: 7 22
2: 8 23
3: 9 24

Upvotes: 1

Related Questions