Reputation: 63
I'm new to R and could use some help with the following problem:
I have a rather large dataset in a data.table
format and I want to loop over a group of variables that are indexed by a number (say, x_1, x_2, ..., x_n). To make things simple, let's say I want to take the mean of each variable for different values of a variable y and name them, (m_1,m_2, ..., m_n) in my data.table
.
Can someone suggest an efficient code that does this? n
and the number of variables like x_*
are too many for me to do this one by one.
Thanks
Upvotes: 3
Views: 47
Reputation: 145765
Very simply and efficiently:
ind = 1:5 # replace 5 with your n
for (i in ind) {
set(df, j = paste("m", i, sep = "_"), value = mean(df[[paste("x", i, sep = "_")]]))
}
set
is usually extremely fast. It doesn't allow grouped operations, so if you need to group by another column, you'll need a different approach, for example:
ind = 1:5
df[, paste("m", ind, sep = "_") := lapply(.SD, mean), .SDcols = paste("x", ind, sep = "_")]
In the above, you could use the by
argument normally.
Upvotes: 5
Reputation: 4279
This approach works with dplyr; not sure how to do the same with data.table.
library(dplyr)
df <- tibble(group = factor(rep(letters[1:4], 5)),
x_1 = rnorm(20, mean = 10),
x_2 = rnorm(20, mean = 20),
x_3 = rnorm(20, mean = 30))
group_by(df, group) %>%
summarize_all(.funs = c(mean, sd))
# # A tibble: 4 x 7
# group x_1_fn1 x_2_fn1 x_3_fn1 x_1_fn2 x_2_fn2 x_3_fn2
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 10.1 19.9 30.1 0.684 0.792 0.461
# 2 b 9.99 19.2 30.2 1.14 1.20 0.960
# 3 c 9.32 20.3 30.0 0.762 0.721 1.56
# 4 d 9.89 19.9 29.9 1.29 1.39 0.589
Upvotes: 2