Reputation: 39
Hello fellow Overflowers,
the goal is to process certain steps of data manipulation on a fairy big dataset. In a first step, certain variables, which represent different cases of a certain information, shall be aggregated for each case. There are always 5 variables to aggregate.
Right now, the dataset looks like this:
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 ... xyz5 A B C
case1 3 4 7 9 6 21 13 4 1 7 8
case2 9 12 8 17 25 31 7 2 7 6
case3 5 3 11 10 32 19 13 5 1 6 8
...
It should somehow look like this
mean-a mean-b ...mean-xyz A B C
case1 5,8 17 6,4 1 7 8
case2 9,6 24,3 8,3 2 7 6
case3 7,25 21,3 7 1 6 8
...
I'm not sure if building a function or using the across
function from the dplyr
package is the right way to do it, since there are about 2000 variables which need to be aggregated.
Any help will be greatly appreciated.
Thanks a lot in advance!
Upvotes: 1
Views: 46
Reputation: 21938
You can also use the following solution:
library(dplyr)
library(stringr)
library(purrr)
# First we extract the unique letter values of column names
letters <- unique(str_remove(names(df), "\\d"))
[1] "a" "b"
# Then we iterate over each unique values and extract the columns that contain that unique letter
letters %>%
map(~ df %>%
select(contains(.x)) %>%
rowwise() %>%
mutate("mean_{.x}" := mean(c_across(contains(.x)), na.rm = TRUE))) %>%
bind_cols() %>%
relocate(contains("mean"), .after = last_col())
# A tibble: 3 x 12
# Rowwise:
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 mean_a mean_b
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 4 7 9 6 21 13 7 8 4 5.8 10.6
2 9 12 8 17 25 31 4 2 2 7 14.2 9.2
3 5 3 11 10 32 19 13 2 2 5 12.2 8.2
Data
df <- tribble(
~a1, ~a2, ~a3, ~a4, ~a5, ~b1, ~b2, ~b3, ~b4, ~b5,
3, 4, 7, 9, 6, 21, 13, 7, 8, 4,
9, 12, 8, 17, 25, 31, 4, 2, 2, 7,
5, 3, 11, 10, 32, 19, 13, 2, 2, 5
)
Upvotes: 1
Reputation: 971
# toy data
library(data.table)
m <- matrix(1:30, ncol = 10)
colnames(m) <- c(paste0('a', 1:5), paste0('b', 1:5))
d <- data.table(m)
d
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5
1: 1 4 7 10 13 16 19 22 25 28
2: 2 5 8 11 14 17 20 23 26 29
3: 3 6 9 12 15 18 21 24 27 30
You can determine first the groups you want to aggregate.
groups <- split(colnames(d), gsub("\\d", "", colnames(d)))
groups
$a
[1] "a1" "a2" "a3" "a4" "a5"
$b
[1] "b1" "b2" "b3" "b4" "b5"
Afterwards you simply calculate the mean of each group.
> d[,lapply(groups, function(i) {rowMeans(d[, i, with = FALSE])})]
a b
1: 7 22
2: 8 23
3: 9 24
Upvotes: 1