jlakes85
jlakes85

Reputation: 57

R Not Properly Summarizing Qualitative Data

I'm having issues getting the proper summary of a qualitative data column, in both the RGui and RStudio environments. The data in question is the "Auto" data from "An Introduction to Statistical Learning, with Applications in R" (www.StatLearning.com). The issue in the "name" column is present whether I use the "Auto.csv" or "Auto.data" files from the book's website. What's interesting is that the RGui correctly characterizes the "horsepower" column, but RStudio does not. Again, neither correctly characterize the "name" column. Any help to correct this situation would be greatly appreciated.

enter image description hereenter image description here

Upvotes: 0

Views: 425

Answers (1)

Marcelo Avila
Marcelo Avila

Reputation: 2374

There are two unrelated issues here. One is that horsepower has missings values encoded as "?". read.csv() then reads horsepower as a character vector and not as numeric one. The argument na.strings = "?" will fix this.

The other issue is that since version 4.0.0,

R now uses a stringsAsFactors = FALSE default, and hence by default no longer converts strings to factors in calls to data.frame() and read.table().

With that, scripts that omits the stringsAsFactors will have different results in older versions. Previous to version 4.0.0 it will convert to factors automatically and since 4.0.0 in will read as character per default. If you wish to convert to factors just set stringsAsFactors = TRUE. (Or convert it to factors later on with as.factor()).

The reasoning behind the change explained in depth here. The most compelling reason, in my opinion, is due to reproducibility issues when automatically converting to factors.

When creating a factor from a character vector, if the levels are not given explicitly the sorted unique values are used for the levels, and of course the result of sorting is locale-dependent

So, if you wish to convert to factors and be sure the same script will produce the same results regardless of your locale (i.e language settings), it is advisable to manually convert to factors and set the levels explicitly.

Example with and withtout stringsAsFactors argument

url <- "https://www.statlearning.com/s/Auto.csv"
df_factor <- read.csv(url, stringsAsFactors = TRUE, na.strings = "?")
summary(df_factor)
#>       mpg          cylinders      displacement     horsepower        weight    
#>  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
#>  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.0   1st Qu.: 75.0   1st Qu.:2223  
#>  Median :23.00   Median :4.000   Median :146.0   Median : 93.5   Median :2800  
#>  Mean   :23.52   Mean   :5.458   Mean   :193.5   Mean   :104.5   Mean   :2970  
#>  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   3rd Qu.:126.0   3rd Qu.:3609  
#>  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
#>                                                  NA's   :5                     
#>   acceleration        year           origin                  name    
#>  Min.   : 8.00   Min.   :70.00   Min.   :1.000   ford pinto    :  6  
#>  1st Qu.:13.80   1st Qu.:73.00   1st Qu.:1.000   amc matador   :  5  
#>  Median :15.50   Median :76.00   Median :1.000   ford maverick :  5  
#>  Mean   :15.56   Mean   :75.99   Mean   :1.574   toyota corolla:  5  
#>  3rd Qu.:17.10   3rd Qu.:79.00   3rd Qu.:2.000   amc gremlin   :  4  
#>  Max.   :24.80   Max.   :82.00   Max.   :3.000   amc hornet    :  4  
#>                                                  (Other)       :368

df_string <- read.csv(url, stringsAsFactors = FALSE, na.strings = "?")
summary(df_string)
#>       mpg          cylinders      displacement     horsepower        weight    
#>  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
#>  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.0   1st Qu.: 75.0   1st Qu.:2223  
#>  Median :23.00   Median :4.000   Median :146.0   Median : 93.5   Median :2800  
#>  Mean   :23.52   Mean   :5.458   Mean   :193.5   Mean   :104.5   Mean   :2970  
#>  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   3rd Qu.:126.0   3rd Qu.:3609  
#>  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
#>                                                  NA's   :5                     
#>   acceleration        year           origin          name          
#>  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:397        
#>  1st Qu.:13.80   1st Qu.:73.00   1st Qu.:1.000   Class :character  
#>  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
#>  Mean   :15.56   Mean   :75.99   Mean   :1.574                     
#>  3rd Qu.:17.10   3rd Qu.:79.00   3rd Qu.:2.000                     
#>  Max.   :24.80   Max.   :82.00   Max.   :3.000                     
#> 

Created on 2021-03-21 by the reprex package (v1.0.0)

Upvotes: 1

Related Questions