Scharron
Scharron

Reputation: 17767

data.frame with a column containing a matrix in R

I'm trying to put some matrices in a dataframe in R, something like :

m <- matrix(c(1,2,3,4), nrow=2, ncol=2)
df <- data.frame(id=1, mat=m)

But when I do that, I get a dataframe with 2 rows and 3 columns instead of a dataframe with 1 row and 2 columns.

Reading the documentation, I have to escape my matrix using I().

df <- data.frame(id=1, mat=I(m))

str(df)
'data.frame':   2 obs. of  2 variables:
 $ id : num  1 1
 $ mat: AsIs [1:2, 1:2] 1 2 3 4

As I understand it, the dataframe contains one row for each row of the matrix, and the mat field is a list of matrix column values.

Thus, how can I obtain a dataframe containing matrices ?

Thanks !

Upvotes: 6

Views: 15877

Answers (6)

GKi
GKi

Reputation: 39667

To get a data.frame with 1 row and 2 columns for the given example you have to put the matrix inside a list.

m <- matrix(1:4, 2)

x <- list2DF(list(id=1, mat=list(m)))
x
#  id        mat
#1  1 1, 2, 3, 4

str(x)
#'data.frame':   1 obs. of  2 variables:
# $ id : num 1
# $ mat:List of 1
#  ..$ : int [1:2, 1:2] 1 2 3 4


y <- data.frame(id=1, mat=I(list(m)))
y
#  id        mat
#1  1 1, 2, 3, 4

str(y)
#'data.frame':   1 obs. of  2 variables:
# $ id : num 1
# $ mat:List of 1
#  ..$ : int [1:2, 1:2] 1 2 3 4
#  ..- attr(*, "class")= chr "AsIs"

To create a data.frame with a column containing a matrix, with the given data with 2 rows and 2 columns, directly when creating the data.frame using I() will be straight forward. An alternative without AsIs could be to insert it later, as already shown by others.

m <- matrix(1:4, 2)

x <- data.frame(id=1, mat=I(m))
str(x)
'data.frame':   2 obs. of  2 variables:
 $ id : num  1 1
 $ mat: 'AsIs' int [1:2, 1:2] 1 2 3 4

y <- data.frame(id=rep(1, nrow(m)))
y[["m"]] <- m
#y["m"] <- m   #Alternative
#y[,"m"] <- m  #Alternative
#y$m <- m      #Alternative
str(y)
#'data.frame':   2 obs. of  2 variables:
# $ id: num  1 1
# $ m : int [1:2, 1:2] 1 2 3 4

z <- `[<-`(data.frame(id=rep(1, nrow(m))), , "mat", m)
str(z)
#'data.frame':   2 obs. of  2 variables:
# $ id : num  1 1
# $ mat: int [1:2, 1:2] 1 2 3 4

Alternatively the data can be stored in a list.

m <- matrix(1:4, 2)
x <- list(id=1, mat=m)
x
#$id
#[1] 1
#
#$mat
#     [,1] [,2]
#[1,]    1    3
#[2,]    2    4

str(x)
#List of 2
# $ id : num 1
# $ mat: int [1:2, 1:2] 1 2 3 4

Upvotes: 0

Jonathan Gellar
Jonathan Gellar

Reputation: 325

Data frames containing matrix columns do have their uses in specialized scenarios. These scenarios are cases when you have a whole vector of some variable for every observation in your data set. There are two cases that I have come across where this is common:

  1. Bayesian analysis: you create a posterior prediction for each observation, so for every "row" in your newdata, you have an entire vector (the length of that vector is the number of MCMC iterations).
  2. Functional data analysis: each "observation" is itself a function, and you store the observed realization of that function as a vector.

If you're working with data frames, there are a few obvious ways to handle this data that are both inefficient. I'll use the Bayesian case as an example:

  1. "Super-wide" format: you have one column for each element of the vectors, in addition to your other columns of the data frame. This makes an extremely wide data frame that is often hard to work with. It also makes it difficult to refer to only those columns that correspond to the posterior.
  2. "Super-long" (tidy) format: very memory intensive because all of the other columns of your data frame have to be repeated unnecessarily for every iteration of the posterior.
  3. List-columns: you can create a list where each element is the vector corresponding to the posterior for that row of the data frame. The problem here is that most of the manipulation you want to do will require you to unlist the posterior back to a matrix, and the listing/unlisting is unnecessary compuation.

Data frames with matrix columns are a very useful solution to this situation. The posterior stays in a matrix that has the same number of rows as the data frame. But that matrix only is recognized as a single "column" in the data frame, and referring to that column using df$mat will return the matrix. You can even use some dplyr functions like filtering to return the corresponding rows of the matrix, but this is a bit experimental.

The easiest method to create the matrix column is in two steps. First create the data frame without the matrix column, then add the matrix column with a simple assignment. I haven't found a 1-step solution to do this that doesn't involve I() which changes the column type.

m <- matrix(c(1,2,3,4), nrow=2, ncol=2)
df <- data.frame(id = rep(1, nrow(m)))
df$mat <- m
names(df)
# [1] "id"  "mat"
str(df)
# 'data.frame': 2 obs. of  2 variables:
#  $ id : num  1 1
#  $ mat: num [1:2, 1:2] 1 2 3 4

Upvotes: 4

zoc99
zoc99

Reputation: 105

I came across the same problem trying to understand the gasoline data in pls package. Used $ for the job. First, lets create a matrix, lets call it spectra_mat, then a vector called response_var1.

spectra_mat = matrix(1:45, 9, 5)
response_var1 = seq(1:9)

Now we put the vector response_var1 in a new data frame - lets call it df.

df = data.frame(response_var1)
df$spectra = spectra_mat

To check,

str(df)

'data.frame':   9 obs. of  2 variables:
 $ response_var1: int  1 2 3 4 5 6 7 8 9
 $ spectra      : int [1:9, 1:5] 1 2 3 4 5 6 7 8 9 10 ...

Upvotes: 5

adamleerich
adamleerich

Reputation: 5865

A much easier way to do this is to define the data frame with a placeholder for the matrix

m <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2) 
df <- data.frame(id = 1, mat = rep(0, nrow(m)))

Then to assign the matrix. No need to play with the class of a list or to use an *apply() function.

df$mat <- m

Upvotes: 5

Ben Bolker
Ben Bolker

Reputation: 226322

I find data.frames containing matrices mind-bendingly weird, but: the only way I know to achieve this is hidden in stats:::simulate.lm

Try this, poke through and see what's happening:

d <- data.frame(y=1:5,n=5)
g0 <- glm(cbind(y,n-y)~1,data=d,family=binomial)
debug(stats:::simulate.lm)
s <- simulate(g0,n=5)

This is the weird, back-door solution. Create a list, change its class to data.frame, and then (this is required) set the names and row.names manually (if you don't do those final steps the data will still be in the object, but it will print out as though it had zero rows ...)

m1 <- matrix(1:10,ncol=2)
m2 <- matrix(5:14,ncol=2)
dd <- list(m1,m2)
class(dd) <- "data.frame"
names(dd) <- LETTERS[1:2]
row.names(dd) <- 1:5
dd

Upvotes: 7

chl
chl

Reputation: 29367

The result you got (2 rows x 3 columns) is what is to be expected from R, as it amounts to cbind a vector (id, with recycling) and a matrix (m).

IMO, it would be better to use list or array (when dimensions agree, no mix of numeric and factors values allowed), if you really want to bind different data structures. Otherwise, just cbind your matrix to an existing data.frame if both have the same number of rows will do the job. For example

x1 <- replicate(2, rnorm(10))
x2 <- replicate(2, rnorm(10))
x12l <- list(x1=x1, x2=x2)
x12a <- array(rbind(x1, x2), dim=c(10,2,2))

and the results reads

> str(x12l)
List of 2
 $ x1: num [1:10, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...
 $ x2: num [1:10, 1:2] -0.164 0.709 -0.268 -1.464 0.744 ...
> str(x12a)
 num [1:10, 1:2, 1:2] -0.326 0.552 -0.675 0.214 0.311 ...

Lists are easier to use if you plan to use matrix of varying dimensions, and providing they are organized in the same way (for rows) as an external data.frame you can subset them as easily. Here is an example:

df1 <- data.frame(grp=gl(2, 5, labels=LETTERS[1:2]), 
                  age=sample(seq(25,35), 10, rep=T))
with(df1, tapply(x12l$x1[,1], list(grp, age), mean))

You can also use lapply (for list) and apply (for array) functions.

Upvotes: 1

Related Questions