D.C. the III
D.C. the III

Reputation: 340

What is the reason behind why the added column to my data frame in R is of a "list" form instead of a data type? and how to rectify?

I'm doing some practice exercises with respect to regression analysis in R. One of the questions asks me to perform some simple analysis using a linear regression function. To document the intermediate steps, I added the values to a preloaded data set which is what the following screen grab is:

enter image description here

The columns GPA and ACT_Score came preloaded with the data set. The following is the code I used to add the fitted_GPA column:

> GPA_lm = lm(formula = GPA ~ ACT_Score, data = ch_1_exer_119_GPA)
> fitted_GPA = coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]]*ch_1_exer_119_GPA[,2]   # created vector of fitted values
> ch_1_exer_119_GPA$fitted_GPA = fitted_GPA       # added column of fitted values to data frame

So now when I got to examine the type of my new column compared to one of the preloaded columns I have the following observation

> typeof(ch_1_exer_119_GPA$fitted_GPA)  #added column to data frame
[1] "list"
> typeof(ch_1_exer_119_GPA$GPA)  #preloaded column to data frame
[1] "double"

This came up when I was entering the name of one of the created columns for another calculation and noticed that the icon in front of the variable was not a "purple tag" like the variables that came loaded with the data set, but instead had the "data frame" icon.

This didn't have a direct effect on any of the simple calculations I did, but I can envision something like this presenting a problem in the future when I'm dealing with more complex scenarios. So I'd like to get an understanding as to what it is that I did to create this and how to rectify it?

Thank you in advance.

EDIT: As requested from r2evans the following output:

> dput(head(ch_1_exer_119_GPA,15))
structure(list(GPA = c(3.897, 3.885, 3.778, 2.54, 3.028, 3.865, 
2.962, 3.961, 0.5, 3.178, 3.31, 3.538, 3.083, 3.013, 3.245), 
    ACT_Score = c(21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24, 
    30, 24, 24, 33), fitted_GPA = structure(list(ACT_Score = c(2.92941895227791, 
    2.65762906394109, 3.20120884061472, 2.96824607918317, 2.92941895227791, 
    3.3176902213305, 3.35651734823576, 3.16238171370946, 3.24003596751998, 
    3.1235545868042, 3.04590033299369, 3.27886309442524, 3.04590033299369, 
    3.04590033299369, 3.39534447514102)), class = "data.frame", row.names = c(NA, 
    -15L)), residuals_GPA = structure(list(ACT_Score = c(0.967581047722093, 
    1.22737093605891, 0.576791159385276, -0.428246079183166, 
    0.0985810477220932, 0.547309778669498, -0.394517348235762, 
    0.798618286290536, -2.74003596751998, 0.0544454131957952, 
    0.264099667006314, 0.259136905574757, 0.0370996670063146, 
    -0.0329003329936857, -0.150344475141022)), class = "data.frame", row.names = c(NA, 
    -15L))), row.names = c(NA, -15L), class = c("tbl_df", "tbl", 
"data.frame"))

Upvotes: 2

Views: 1100

Answers (3)

jpdugo17
jpdugo17

Reputation: 7106

Tibble allow the creation of list columns. As mentioned before, a data.frame is just a special kind of list. For example, to store an lm object inside of a tibble we could do this:

require(tidyverse)
#> Loading required package: tidyverse

data.frame(values = list(lm(cyl ~ hwy, mpg))) %>%
  `[`(1, 1, 1)
#> Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors): cannot coerce class '"lm"' to a data.frame

#but this will work 

tibble(values = list(lm(cyl ~ hwy, mpg))) %T>%
  print() %>% 
  `[`(1, 1, 1)
#> # A tibble: 1 x 1
#>   values
#>   <list>
#> 1 <lm>
#> [[1]]
#> 
#> Call:
#> lm(formula = cyl ~ hwy, data = mpg)
#> 
#> Coefficients:
#> (Intercept)          hwy  
#>     10.7223      -0.2062

Created on 2021-06-28 by the reprex package (v2.0.0)

In this particular case we can use unnest() to expand the dataframe.

require(tidyverse)
#> Loading required package: tidyverse

glimpse(ch_1_exer_119_GPA)
#> Rows: 15
#> Columns: 4
#> $ GPA           <dbl> 3.897, 3.885, 3.778, 2.540, 3.028, 3.865, 2.962, 3.961, …
#> $ ACT_Score     <dbl> 21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24, 30, 24, 24, …
#> $ fitted_GPA    <df[,1]> <data.frame[15 x 1]>
#> $ residuals_GPA <df[,1]> <data.frame[15 x 1]>

unnest(ch_1_exer_119_GPA, c(fitted_GPA, residuals_GPA)) %T>%
  print() %>% 
  glimpse()
#> # A tibble: 225 x 4
#>      GPA ACT_Score fitted_GPA residuals_GPA
#>    <dbl>     <dbl>      <dbl>         <dbl>
#>  1  3.90        21       2.93        0.968 
#>  2  3.90        21       2.66        1.23  
#>  3  3.90        21       3.20        0.577 
#>  4  3.90        21       2.97       -0.428 
#>  5  3.90        21       2.93        0.0986
#>  6  3.90        21       3.32        0.547 
#>  7  3.90        21       3.36       -0.395 
#>  8  3.90        21       3.16        0.799 
#>  9  3.90        21       3.24       -2.74  
#> 10  3.90        21       3.12        0.0544
#> # … with 215 more rows
#> Rows: 225
#> Columns: 4
#> $ GPA           <dbl> 3.897, 3.897, 3.897, 3.897, 3.897, 3.897, 3.897, 3.897, …
#> $ ACT_Score     <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, …
#> $ fitted_GPA    <dbl> 2.929419, 2.657629, 3.201209, 2.968246, 2.929419, 3.3176…
#> $ residuals_GPA <dbl> 0.96758105, 1.22737094, 0.57679116, -0.42824608, 0.09858…

Created on 2021-06-28 by the reprex package (v2.0.0)

Upvotes: 1

Marcelo Avila
Marcelo Avila

Reputation: 2374

basically, what is happening is that in

fitted_GPA = coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]]*ch_1_exer_119_GPA[,2]   # created vector of fitted values

the resulting object fitted_GPA is a dataframe with one variable and not a vector. Behind the scenes, "a data frame is a list of equal-length vectors"

If you replace the above line with

fitted_GPA = coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]] * ch_1_exer_119_GPA$ACT_Score   # created vector of fitted values

you will get a vector instead of a data frame, so that when adding the new variable to your data frame with

ch_1_exer_119_GPA$fitted_GPA = fitted_GPA

it works as expected.

Edit: Full Reprex

Here the whole script, indeed the output in the console "looks like" that the rows are <dbl> vectors, but glimpse() shows that the column is actually a data.frame.

library(dplyr)

ch_1_exer_119_GPA <- structure(list(GPA = c(3.897, 3.885, 3.778, 2.54, 3.028, 3.865,  2.962, 3.961, 0.5, 3.178, 3.31, 3.538, 3.083, 3.013, 3.245), 
                                    ACT_Score = c(21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24,30, 24, 24, 33),
                                    fitted_GPA = structure(list(
                                      ACT_Score = c(2.92941895227791, 2.65762906394109, 3.20120884061472, 2.96824607918317,
                                                    2.92941895227791, 3.3176902213305, 3.35651734823576, 3.16238171370946,
                                                    3.24003596751998, 3.1235545868042, 3.04590033299369, 3.27886309442524,
                                                    3.04590033299369, 3.04590033299369, 3.39534447514102)), 
                                      class = "data.frame", row.names = c(NA, -15L)),
                                    residuals_GPA = structure(list(ACT_Score = c(0.967581047722093, 1.22737093605891, 0.576791159385276,
                                                                                 -0.428246079183166, 0.0985810477220932, 0.547309778669498,
                                                                                 -0.394517348235762, 0.798618286290536, -2.74003596751998,
                                                                                 0.0544454131957952, 0.264099667006314, 0.259136905574757,
                                                                                 0.0370996670063146,  -0.0329003329936857, -0.150344475141022)),
                                                              class = "data.frame", row.names = c(NA,-15L))), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"))
ch_1_exer_119_GPA
#> # A tibble: 15 x 4
#>      GPA ACT_Score fitted_GPA$ACT_Score residuals_GPA$ACT_Score
#>    <dbl>     <dbl>                <dbl>                   <dbl>
#>  1  3.90        21                 2.93                  0.968 
#>  2  3.88        14                 2.66                  1.23  
#>  3  3.78        28                 3.20                  0.577 
#>  4  2.54        22                 2.97                 -0.428 
#>  5  3.03        21                 2.93                  0.0986
#>  6  3.86        31                 3.32                  0.547 
#>  7  2.96        32                 3.36                 -0.395 
#>  8  3.96        27                 3.16                  0.799 
#>  9  0.5         29                 3.24                 -2.74  
#> 10  3.18        26                 3.12                  0.0544
#> 11  3.31        24                 3.05                  0.264 
#> 12  3.54        30                 3.28                  0.259 
#> 13  3.08        24                 3.05                  0.0371
#> 14  3.01        24                 3.05                 -0.0329
#> 15  3.24        33                 3.40                 -0.150

glimpse(ch_1_exer_119_GPA)
#> Rows: 15
#> Columns: 4
#> $ GPA           <dbl> 3.897, 3.885, 3.778, 2.540, 3.028, 3.865, 2.962, 3.961, …
#> $ ACT_Score     <dbl> 21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24, 30, 24, 24, …
#> $ fitted_GPA    <df[,1]> <data.frame[15 x 1]>
#> $ residuals_GPA <df[,1]> <data.frame[15 x 1]>

GPA_lm = lm(formula = GPA ~ ACT_Score, data = ch_1_exer_119_GPA)
fitted_GPA = coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]]*ch_1_exer_119_GPA[,2]   # created vector of fitted values
ch_1_exer_119_GPA$fitted_GPA = fitted_GPA
glimpse(ch_1_exer_119_GPA)
#> Rows: 15
#> Columns: 4
#> $ GPA           <dbl> 3.897, 3.885, 3.778, 2.540, 3.028, 3.865, 2.962, 3.961, …
#> $ ACT_Score     <dbl> 21, 14, 28, 22, 21, 31, 32, 27, 29, 26, 24, 30, 24, 24, …
#> $ fitted_GPA    <df[,1]> <data.frame[15 x 1]>
#> $ residuals_GPA <df[,1]> <data.frame[15 x 1]>
ch_1_exer_119_GPA
#> # A tibble: 15 x 4
#>      GPA ACT_Score fitted_GPA$ACT_Score residuals_GPA$ACT_Score
#>    <dbl>     <dbl>                <dbl>                   <dbl>
#>  1  3.90        21                 3.32                  0.968 
#>  2  3.88        14                 3.53                  1.23  
#>  3  3.78        28                 3.12                  0.577 
#>  4  2.54        22                 3.29                 -0.428 
#>  5  3.03        21                 3.32                  0.0986
#>  6  3.86        31                 3.03                  0.547 
#>  7  2.96        32                 3.00                 -0.395 
#>  8  3.96        27                 3.15                  0.799 
#>  9  0.5         29                 3.09                 -2.74  
#> 10  3.18        26                 3.18                  0.0544
#> 11  3.31        24                 3.24                  0.264 
#> 12  3.54        30                 3.06                  0.259 
#> 13  3.08        24                 3.24                  0.0371
#> 14  3.01        24                 3.24                 -0.0329
#> 15  3.24        33                 2.97                 -0.150

Created on 2021-06-28 by the reprex package (v2.0.0)

Upvotes: 1

Roman Luštrik
Roman Luštrik

Reputation: 70643

This is your data

> str(xy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   15 obs. of  4 variables:
 $ GPA          : num  3.9 3.88 3.78 2.54 3.03 ...
 $ ACT_Score    : num  21 14 28 22 21 31 32 27 29 26 ...
 $ fitted_GPA   :'data.frame':  15 obs. of  1 variable:
  ..$ ACT_Score: num  2.93 2.66 3.2 2.97 2.93 ...
 $ residuals_GPA:'data.frame':  15 obs. of  1 variable:
  ..$ ACT_Score: num  0.9676 1.2274 0.5768 -0.4282 0.0986 ...

Notice that fitted_GPA is a data.frame added inside a data.frame. This works because a data.frame is just a special list and as you know, you can have a list of lists... Anyway, when I run

GPA_lm <- lm(formula = GPA ~ ACT_Score, data = xy)
fitted_GPA <- coef(GPA_lm)[[1]] + coef(GPA_lm)[[2]] * xy[, 2] 

xy$fitted_GPA <- fitted_GPA

I get a nice clean result

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   15 obs. of  4 variables:
 $ GPA          : num  3.9 3.88 3.78 2.54 3.03 ...
 $ ACT_Score    : num  21 14 28 22 21 31 32 27 29 26 ...
 $ fitted_GPA   : num  3.32 3.53 3.12 3.29 3.32 ...
 $ residuals_GPA:'data.frame':  15 obs. of  1 variable:
  ..$ ACT_Score: num  0.9676 1.2274 0.5768 -0.4282 0.0986 ...

> typeof(xy$fitted_GPA)
[1] "double"

Upvotes: 1

Related Questions