mjandrews
mjandrews

Reputation: 2464

In R, split vector valued column in data frame into multiple columns

I have a dataframe where one column is a matrix, not a vector, and I'd like to split it into multiple vector valued columns.

For example, this

set.seed(101)
Df <- data.frame(x = runif(100),
                 y = round(runif(100)))

descriptives <- function(arg) c(mean(arg), median(arg), sd(arg))
results <- aggregate(x ~ y, data=Df, descriptives)

gives me

> results
  y       x.1       x.2       x.3
1 0 0.4710224 0.4280053 0.3206661
2 1 0.5769064 0.6220120 0.2683387

which is a 2x2 dataframe with the second column being a 2x3 matrix:

> dim(results)
[1] 2 2
> results[,2]
          [,1]      [,2]      [,3]
[1,] 0.4710224 0.4280053 0.3206661
[2,] 0.5769064 0.6220120 0.2683387

I'd like to split that matrix into three column vectors.

I could do

results <- cbind(results$y, as.data.frame(results[,2]))
names(results) <- c('y', 'mean', 'median', 'sd')

> results
  y      mean    median        sd
1 0 0.4710224 0.4280053 0.3206661
2 1 0.5769064 0.6220120 0.2683387

But is there are more direct and less hacky way? I am thinking of something like tidyr's separate?

Upvotes: 2

Views: 1734

Answers (4)

lmo
lmo

Reputation: 38500

A cleaner solution in base R is to use the tapply instead of aggregate, which returns a vector, and use lapply to run through each calculation:

# collect desired statistical calculations in a named list
descriptives <- list(mean=mean, median=median, sd=sd)

# get data.frame of values
dfStats <- data.frame(lapply(descriptives, function(f) tapply(Df$x, Df$y, f)))
dfStats
       mean    median        sd
0 0.4710224 0.4280053 0.3206661
1 0.5769064 0.6220120 0.2683387

Here, lapply runs through the list of functions and applies tapply to get group calculations. The resulting list is put into a data.frame with data.frame. Note that y is the row.names here, but you can easily pull that out with row.names if desired.

dfStats$y <- row.names(dfStats)

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193527

One approach to fix the problem (though, as others have pointed, out, the problem can be avoided) is to use do.call(data.frame, ...).

Example:

do.call(data.frame, results)
#   y       x.1       x.2       x.3
# 1 0 0.4710224 0.4280053 0.3206661
# 2 1 0.5769064 0.6220120 0.2683387
str(.Last.value)
# 'data.frame': 2 obs. of  4 variables:
#  $ y  : num  0 1
#  $ x.1: num  0.471 0.577
#  $ x.2: num  0.428 0.622
#  $ x.3: num  0.321 0.268

Upvotes: 2

Gregor Thomas
Gregor Thomas

Reputation: 145775

In this case, with a matrix, you can use cbind:

cbind(results[1], results[[2]])
#   y         1         2         3
# 1 0 0.4710224 0.4280053 0.3206661
# 2 1 0.5769064 0.6220120 0.2683387

Using [ on the first argument makes sure it is a data.frame, so cbind.data.frame is used and the result is not coerced to matrix. But using [[ on the second argument makes sure that we are binding to the 2x3 matrix in the second column rather than a data frame containing that matrix.

More generally, tidyr::unnest works well with vector or list columns, but the result is in long form - you would have to add a column id and spread it to get it back to wide form.

However, the simplest way may be to use dplyr and summarize_at with a custom .funs argument:

library(dplyr)
descrip = funs(m =  mean, med = median, sd = sd)
Df %>% group_by(y) %>% summarize_at("x", .funs = descrip)
# # A tibble: 2 × 4
#       y         m       med        sd
#   <dbl>     <dbl>     <dbl>     <dbl>
# 1     0 0.4710224 0.4280053 0.3206661
# 2     1 0.5769064 0.6220120 0.2683387

Upvotes: 2

Imran Kocabiyik
Imran Kocabiyik

Reputation: 439

Yes. It is relatively easier with dplyr.

library(dplyr)
results <- Df %>% group_by(y) %>% 
        summarize(mean = mean(x),
                  median = median(x),
                  sd = sd(x))

Upvotes: 1

Related Questions