Reputation: 2464
I have a dataframe where one column is a matrix, not a vector, and I'd like to split it into multiple vector valued columns.
For example, this
set.seed(101)
Df <- data.frame(x = runif(100),
y = round(runif(100)))
descriptives <- function(arg) c(mean(arg), median(arg), sd(arg))
results <- aggregate(x ~ y, data=Df, descriptives)
gives me
> results
y x.1 x.2 x.3
1 0 0.4710224 0.4280053 0.3206661
2 1 0.5769064 0.6220120 0.2683387
which is a 2x2 dataframe with the second column being a 2x3 matrix:
> dim(results)
[1] 2 2
> results[,2]
[,1] [,2] [,3]
[1,] 0.4710224 0.4280053 0.3206661
[2,] 0.5769064 0.6220120 0.2683387
I'd like to split that matrix into three column vectors.
I could do
results <- cbind(results$y, as.data.frame(results[,2]))
names(results) <- c('y', 'mean', 'median', 'sd')
> results
y mean median sd
1 0 0.4710224 0.4280053 0.3206661
2 1 0.5769064 0.6220120 0.2683387
But is there are more direct and less hacky way? I am thinking of something like tidyr
's separate
?
Upvotes: 2
Views: 1734
Reputation: 38500
A cleaner solution in base R is to use the tapply
instead of aggregate
, which returns a vector, and use lapply
to run through each calculation:
# collect desired statistical calculations in a named list
descriptives <- list(mean=mean, median=median, sd=sd)
# get data.frame of values
dfStats <- data.frame(lapply(descriptives, function(f) tapply(Df$x, Df$y, f)))
dfStats
mean median sd
0 0.4710224 0.4280053 0.3206661
1 0.5769064 0.6220120 0.2683387
Here, lapply
runs through the list of functions and applies tapply
to get group calculations. The resulting list is put into a data.frame with data.frame
. Note that y is the row.names here, but you can easily pull that out with row.names
if desired.
dfStats$y <- row.names(dfStats)
Upvotes: 2
Reputation: 193527
One approach to fix the problem (though, as others have pointed, out, the problem can be avoided) is to use do.call(data.frame, ...)
.
Example:
do.call(data.frame, results)
# y x.1 x.2 x.3
# 1 0 0.4710224 0.4280053 0.3206661
# 2 1 0.5769064 0.6220120 0.2683387
str(.Last.value)
# 'data.frame': 2 obs. of 4 variables:
# $ y : num 0 1
# $ x.1: num 0.471 0.577
# $ x.2: num 0.428 0.622
# $ x.3: num 0.321 0.268
Upvotes: 2
Reputation: 145775
In this case, with a matrix, you can use cbind
:
cbind(results[1], results[[2]])
# y 1 2 3
# 1 0 0.4710224 0.4280053 0.3206661
# 2 1 0.5769064 0.6220120 0.2683387
Using [
on the first argument makes sure it is a data.frame
, so cbind.data.frame
is used and the result is not coerced to matrix. But using [[
on the second argument makes sure that we are binding to the 2x3
matrix in the second column rather than a data frame containing that matrix.
More generally, tidyr::unnest
works well with vector
or list
columns, but the result is in long form - you would have to add a column id and spread it to get it back to wide form.
However, the simplest way may be to use dplyr
and summarize_at
with a custom .funs
argument:
library(dplyr)
descrip = funs(m = mean, med = median, sd = sd)
Df %>% group_by(y) %>% summarize_at("x", .funs = descrip)
# # A tibble: 2 × 4
# y m med sd
# <dbl> <dbl> <dbl> <dbl>
# 1 0 0.4710224 0.4280053 0.3206661
# 2 1 0.5769064 0.6220120 0.2683387
Upvotes: 2
Reputation: 439
Yes. It is relatively easier with dplyr
.
library(dplyr)
results <- Df %>% group_by(y) %>%
summarize(mean = mean(x),
median = median(x),
sd = sd(x))
Upvotes: 1