Reputation: 340
I'm doing some practice exercises with respect to regression analysis in R. One of the questions asks me to perform some simple analysis using a linear regression function. To document the intermediate steps, I added the values to a preloaded data set which is what the following screen grab is:
The columns GPA
and ACT_Score
came preloaded with the data set. The following is the code I used to add the fitted_GPA
column:
> GPA_lm = lm(formula = GPA ~ ACT_Score, data = ch_1_exer_119_GPA)
> fitted_GPA = coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]]*ch_1_exer_119_GPA[,2] # created vector of fitted values
> ch_1_exer_119_GPA$fitted_GPA = fitted_GPA # added column of fitted values to data frame
So now when I got to examine the type of my new column compared to one of the preloaded columns I have the following observation
> typeof(ch_1_exer_119_GPA$fitted_GPA) #added column to data frame
[1] "list"
> typeof(ch_1_exer_119_GPA$GPA) #preloaded column to data frame
[1] "double"
This came up when I was entering the name of one of the created columns for another calculation and noticed that the icon in front of the variable was not a "purple tag" like the variables that came loaded with the data set, but instead had the "data frame" icon.
This didn't have a direct effect on any of the simple calculations I did, but I can envision something like this presenting a problem in the future when I'm dealing with more complex scenarios. So I'd like to get an understanding as to what it is that I did to create this and how to rectify it?
Thank you in advance.
EDIT: As requested from r2evans the following output:
> dput(head(ch_1_exer_119_GPA,15))
structure(list(GPA = c(3.897, 3.885, 3.778, 2.54, 3.028, 3.865,
2.962, 3.961, 0.5, 3.178, 3.31, 3.538, 3.083, 3.013, 3.245),
ACT_Score = c(21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24,
30, 24, 24, 33), fitted_GPA = structure(list(ACT_Score = c(2.92941895227791,
2.65762906394109, 3.20120884061472, 2.96824607918317, 2.92941895227791,
3.3176902213305, 3.35651734823576, 3.16238171370946, 3.24003596751998,
3.1235545868042, 3.04590033299369, 3.27886309442524, 3.04590033299369,
3.04590033299369, 3.39534447514102)), class = "data.frame", row.names = c(NA,
-15L)), residuals_GPA = structure(list(ACT_Score = c(0.967581047722093,
1.22737093605891, 0.576791159385276, -0.428246079183166,
0.0985810477220932, 0.547309778669498, -0.394517348235762,
0.798618286290536, -2.74003596751998, 0.0544454131957952,
0.264099667006314, 0.259136905574757, 0.0370996670063146,
-0.0329003329936857, -0.150344475141022)), class = "data.frame", row.names = c(NA,
-15L))), row.names = c(NA, -15L), class = c("tbl_df", "tbl",
"data.frame"))
Upvotes: 2
Views: 1100
Reputation: 7106
Tibble
allow the creation of list columns. As mentioned before, a data.frame is just a special kind of list. For example, to store an lm
object inside of a tibble we could do this:
require(tidyverse)
#> Loading required package: tidyverse
data.frame(values = list(lm(cyl ~ hwy, mpg))) %>%
`[`(1, 1, 1)
#> Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors): cannot coerce class '"lm"' to a data.frame
#but this will work
tibble(values = list(lm(cyl ~ hwy, mpg))) %T>%
print() %>%
`[`(1, 1, 1)
#> # A tibble: 1 x 1
#> values
#> <list>
#> 1 <lm>
#> [[1]]
#>
#> Call:
#> lm(formula = cyl ~ hwy, data = mpg)
#>
#> Coefficients:
#> (Intercept) hwy
#> 10.7223 -0.2062
Created on 2021-06-28 by the reprex package (v2.0.0)
In this particular case we can use unnest()
to expand the dataframe.
require(tidyverse)
#> Loading required package: tidyverse
glimpse(ch_1_exer_119_GPA)
#> Rows: 15
#> Columns: 4
#> $ GPA <dbl> 3.897, 3.885, 3.778, 2.540, 3.028, 3.865, 2.962, 3.961, …
#> $ ACT_Score <dbl> 21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24, 30, 24, 24, …
#> $ fitted_GPA <df[,1]> <data.frame[15 x 1]>
#> $ residuals_GPA <df[,1]> <data.frame[15 x 1]>
unnest(ch_1_exer_119_GPA, c(fitted_GPA, residuals_GPA)) %T>%
print() %>%
glimpse()
#> # A tibble: 225 x 4
#> GPA ACT_Score fitted_GPA residuals_GPA
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3.90 21 2.93 0.968
#> 2 3.90 21 2.66 1.23
#> 3 3.90 21 3.20 0.577
#> 4 3.90 21 2.97 -0.428
#> 5 3.90 21 2.93 0.0986
#> 6 3.90 21 3.32 0.547
#> 7 3.90 21 3.36 -0.395
#> 8 3.90 21 3.16 0.799
#> 9 3.90 21 3.24 -2.74
#> 10 3.90 21 3.12 0.0544
#> # … with 215 more rows
#> Rows: 225
#> Columns: 4
#> $ GPA <dbl> 3.897, 3.897, 3.897, 3.897, 3.897, 3.897, 3.897, 3.897, …
#> $ ACT_Score <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, …
#> $ fitted_GPA <dbl> 2.929419, 2.657629, 3.201209, 2.968246, 2.929419, 3.3176…
#> $ residuals_GPA <dbl> 0.96758105, 1.22737094, 0.57679116, -0.42824608, 0.09858…
Created on 2021-06-28 by the reprex package (v2.0.0)
Upvotes: 1
Reputation: 2374
basically, what is happening is that in
fitted_GPA = coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]]*ch_1_exer_119_GPA[,2] # created vector of fitted values
the resulting object fitted_GPA
is a dataframe with one variable and not a vector. Behind the scenes, "a data frame is a list of equal-length vectors"
If you replace the above line with
fitted_GPA = coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]] * ch_1_exer_119_GPA$ACT_Score # created vector of fitted values
you will get a vector instead of a data frame, so that when adding the new variable to your data frame with
ch_1_exer_119_GPA$fitted_GPA = fitted_GPA
it works as expected.
Here the whole script, indeed the output in the console "looks like" that the rows are <dbl>
vectors, but glimpse()
shows that the column is actually a data.frame
.
library(dplyr)
ch_1_exer_119_GPA <- structure(list(GPA = c(3.897, 3.885, 3.778, 2.54, 3.028, 3.865, 2.962, 3.961, 0.5, 3.178, 3.31, 3.538, 3.083, 3.013, 3.245),
ACT_Score = c(21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24,30, 24, 24, 33),
fitted_GPA = structure(list(
ACT_Score = c(2.92941895227791, 2.65762906394109, 3.20120884061472, 2.96824607918317,
2.92941895227791, 3.3176902213305, 3.35651734823576, 3.16238171370946,
3.24003596751998, 3.1235545868042, 3.04590033299369, 3.27886309442524,
3.04590033299369, 3.04590033299369, 3.39534447514102)),
class = "data.frame", row.names = c(NA, -15L)),
residuals_GPA = structure(list(ACT_Score = c(0.967581047722093, 1.22737093605891, 0.576791159385276,
-0.428246079183166, 0.0985810477220932, 0.547309778669498,
-0.394517348235762, 0.798618286290536, -2.74003596751998,
0.0544454131957952, 0.264099667006314, 0.259136905574757,
0.0370996670063146, -0.0329003329936857, -0.150344475141022)),
class = "data.frame", row.names = c(NA,-15L))), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"))
ch_1_exer_119_GPA
#> # A tibble: 15 x 4
#> GPA ACT_Score fitted_GPA$ACT_Score residuals_GPA$ACT_Score
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3.90 21 2.93 0.968
#> 2 3.88 14 2.66 1.23
#> 3 3.78 28 3.20 0.577
#> 4 2.54 22 2.97 -0.428
#> 5 3.03 21 2.93 0.0986
#> 6 3.86 31 3.32 0.547
#> 7 2.96 32 3.36 -0.395
#> 8 3.96 27 3.16 0.799
#> 9 0.5 29 3.24 -2.74
#> 10 3.18 26 3.12 0.0544
#> 11 3.31 24 3.05 0.264
#> 12 3.54 30 3.28 0.259
#> 13 3.08 24 3.05 0.0371
#> 14 3.01 24 3.05 -0.0329
#> 15 3.24 33 3.40 -0.150
glimpse(ch_1_exer_119_GPA)
#> Rows: 15
#> Columns: 4
#> $ GPA <dbl> 3.897, 3.885, 3.778, 2.540, 3.028, 3.865, 2.962, 3.961, …
#> $ ACT_Score <dbl> 21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24, 30, 24, 24, …
#> $ fitted_GPA <df[,1]> <data.frame[15 x 1]>
#> $ residuals_GPA <df[,1]> <data.frame[15 x 1]>
GPA_lm = lm(formula = GPA ~ ACT_Score, data = ch_1_exer_119_GPA)
fitted_GPA = coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]]*ch_1_exer_119_GPA[,2] # created vector of fitted values
ch_1_exer_119_GPA$fitted_GPA = fitted_GPA
glimpse(ch_1_exer_119_GPA)
#> Rows: 15
#> Columns: 4
#> $ GPA <dbl> 3.897, 3.885, 3.778, 2.540, 3.028, 3.865, 2.962, 3.961, …
#> $ ACT_Score <dbl> 21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24, 30, 24, 24, …
#> $ fitted_GPA <df[,1]> <data.frame[15 x 1]>
#> $ residuals_GPA <df[,1]> <data.frame[15 x 1]>
ch_1_exer_119_GPA
#> # A tibble: 15 x 4
#> GPA ACT_Score fitted_GPA$ACT_Score residuals_GPA$ACT_Score
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3.90 21 3.32 0.968
#> 2 3.88 14 3.53 1.23
#> 3 3.78 28 3.12 0.577
#> 4 2.54 22 3.29 -0.428
#> 5 3.03 21 3.32 0.0986
#> 6 3.86 31 3.03 0.547
#> 7 2.96 32 3.00 -0.395
#> 8 3.96 27 3.15 0.799
#> 9 0.5 29 3.09 -2.74
#> 10 3.18 26 3.18 0.0544
#> 11 3.31 24 3.24 0.264
#> 12 3.54 30 3.06 0.259
#> 13 3.08 24 3.24 0.0371
#> 14 3.01 24 3.24 -0.0329
#> 15 3.24 33 2.97 -0.150
Created on 2021-06-28 by the reprex package (v2.0.0)
Upvotes: 1
Reputation: 70643
This is your data
> str(xy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 15 obs. of 4 variables:
$ GPA : num 3.9 3.88 3.78 2.54 3.03 ...
$ ACT_Score : num 21 14 28 22 21 31 32 27 29 26 ...
$ fitted_GPA :'data.frame': 15 obs. of 1 variable:
..$ ACT_Score: num 2.93 2.66 3.2 2.97 2.93 ...
$ residuals_GPA:'data.frame': 15 obs. of 1 variable:
..$ ACT_Score: num 0.9676 1.2274 0.5768 -0.4282 0.0986 ...
Notice that fitted_GPA
is a data.frame added inside a data.frame. This works because a data.frame is just a special list and as you know, you can have a list of lists... Anyway, when I run
GPA_lm <- lm(formula = GPA ~ ACT_Score, data = xy)
fitted_GPA <- coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]] * xy[, 2]
xy$fitted_GPA <- fitted_GPA
I get a nice clean result
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 15 obs. of 4 variables:
$ GPA : num 3.9 3.88 3.78 2.54 3.03 ...
$ ACT_Score : num 21 14 28 22 21 31 32 27 29 26 ...
$ fitted_GPA : num 3.32 3.53 3.12 3.29 3.32 ...
$ residuals_GPA:'data.frame': 15 obs. of 1 variable:
..$ ACT_Score: num 0.9676 1.2274 0.5768 -0.4282 0.0986 ...
> typeof(xy$fitted_GPA)
[1] "double"
Upvotes: 1