Reputation: 635
I wish to create a data.frame with two columns, and each column contains multiple columns. (I need it to feed plsr in the pls package)
It's like the oliveoil data.
> oliveoil
chemical.Acidity chemical.Peroxide chemical.K232 chemical.K270 chemical.DK sensory.yellow sensory.green
G1 0.7300 12.7000 1.9000 0.1390 0.0030 21.4 73.4
G2 0.1900 12.3000 1.6780 0.1160 -0.0040 23.4 66.3
G3 0.2600 10.3000 1.6290 0.1160 -0.0050 32.7 53.5
G4 0.6700 13.7000 1.7010 0.1680 -0.0020 30.2 58.3
G5 0.5200 11.2000 1.5390 0.1190 -0.0010 51.8 32.5
I1 0.2600 18.7000 2.1170 0.1420 0.0010 40.7 42.9
I2 0.2400 15.3000 1.8910 0.1160 0.0000 53.8 30.4
I3 0.3000 18.5000 1.9080 0.1250 0.0010 26.4 66.5
I4 0.3500 15.6000 1.8240 0.1040 0.0000 65.7 12.1
I5 0.1900 19.4000 2.2220 0.1580 -0.0030 45.0 31.9
S1 0.1500 10.5000 1.5220 0.1160 -0.0040 70.9 12.2
S2 0.1600 8.1400 1.5270 0.1063 -0.0020 73.5 9.7
S3 0.2700 12.5000 1.5550 0.0930 -0.0020 68.1 12.0
S4 0.1600 11.0000 1.5730 0.0940 -0.0030 67.6 13.9
S5 0.2400 10.8000 1.3310 0.0850 -0.0030 71.4 10.6
S6 0.3000 11.4000 1.4150 0.0930 -0.0040 71.4 10.0
sensory.brown sensory.glossy sensory.transp sensory.syrup
G1 10.1 79.7 75.2 50.3
G2 9.8 77.8 68.7 51.7
G3 8.7 82.3 83.2 45.4
G4 12.2 81.1 77.1 47.8
G5 8.0 72.4 65.3 46.5
I1 20.1 67.7 63.5 52.2
I2 11.5 77.8 77.3 45.2
I3 14.2 78.7 74.6 51.8
I4 10.3 81.6 79.6 48.3
I5 28.4 75.7 72.9 52.8
S1 10.8 87.7 88.1 44.5
S2 8.3 89.9 89.7 42.3
S3 10.8 78.4 75.1 46.4
S4 11.9 84.6 83.8 48.5
S5 10.8 88.1 88.5 46.7
S6 11.4 89.5 88.5 47.2
And it is a data.frame with 2 columns:
> is.data.frame(oliveoil)
[1] TRUE
> dim(oliveoil)
[1] 16 2
I tried the following code:
x = data.frame(a = c(1,2,3), b = c(1,3,4))
y = data.frame(c = c(3,4,5), d = c(5,4,2))
d = data.frame(x = x, y = y)
it returns:
> d
x.a x.b y.c y.d
1 1 1 3 5
2 2 3 4 4
3 3 4 5 2
but I cannot call x with d$x
> d$x
NULL
what I expect is:
> d$x
a b
1 1 1
2 2 3
3 3 4
I am expecting some arguments in the data.frame function make it work, something like:
d = data.frame(x = x, y = y, merge.columns = F)
But I cannot find any arguments doing this in the docs
Upvotes: 0
Views: 177
Reputation: 21264
The pls::plsr()
function does not require data to be set up exactly like oliveoil
. plsr()
allows the response term to be a matrix, and oliveoil
has a particular way of storing matrices, but you can supply any matrix to plsr()
.
For example, this fits a model without error:
y <- matrix(rnorm(n), nrow = 10)
x <- matrix(rnorm(n), nrow = 10)
plsr(y ~ x)
# Partial least squares regression , fitted with the kernel algorithm.
# Call:
# plsr(formula = y ~ x)
Also, consider that the yarn
dataset is also used in the pls
docs, which just stores regular matrices in a data frame rather than the I()
approach used by oliveoil
.
For a bit more explanation:
The sub-components of oliveoil
are not actually of class data.frame
.
If you run str(oliveoil)
, you'll see the sensory
and chemical
objects in oliveoil
are cast as AsIs
objects. They're not technically data frame-classed objects, and in fact they were probably matrices with named rows and columns to begin with.
str(oliveoil)
'data.frame': 16 obs. of 2 variables:
$ chemical: 'AsIs' num [1:16, 1:5] 0.73 0.19 0.26 0.67 0.52 0.26 0.24 0.3 0.35 0.19 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr "G1" "G2" "G3" "G4" ...
.. ..$ : chr "Acidity" "Peroxide" "K232" "K270" ...
$ sensory : 'AsIs' num [1:16, 1:6] 21.4 23.4 32.7 30.2 51.8 40.7 53.8 26.4 65.7 45 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr "G1" "G2" "G3" "G4" ...
.. ..$ : chr "yellow" "green" "brown" "glossy" ...
The AsIs
class means they were stored in oliveoil
using the I()
function (I think "I" is for "Identity"). I()
protects an object from being converted into something else during an operation, like storage into a data frame.
You can reproduce this with a simple example (although note that if you try and store two data frames in a data frame with I()
you'll get an error):
n <- 100
matrix_a <- matrix(rnorm(n), nrow = 10)
matrix_b <- matrix(rnorm(n), nrow = 10)
df <- data.frame(a = I(matrix_a), b = I(matrix_b))
str(df)
'data.frame': 10 obs. of 2 variables:
$ a: 'AsIs' num [1:10, 1:10] -0.817 -0.233 -1.987 0.523 -1.596 ...
$ b: 'AsIs' num [1:10, 1:10] 1.9189 -0.7043 0.0624 0.0152 -0.5409 ...
And df
now contains matrix_a
as $a
and matrix_b
as $b
:
df$a
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.8167554 -0.61629222 0.3673423 1.30882012 0.97618868 -0.53124825
[2,] -0.2329451 0.08556506 -0.5839086 0.86298000 1.20452166 0.09825958
[3,] -1.9873738 -0.93537922 0.1057309 0.63585036 -1.09604531 1.33080572
[4,] 0.5227912 1.89505993 1.1184905 1.20683770 -0.02431886 -1.15878634
# ...
You could also just save matrix_a
and matrix_b
as matrices, directly:
# also works
df2 <- data.frame(a = matrix_a, b = matrix_b, foo = letters[1:10])
TL;DR - plsr()
takes any matrix, but if you want your data stored in a data frame, create a matrix and save it into a data frame, with or without I()
.
Upvotes: 1