Reputation: 57
I'm having issues getting the proper summary of a qualitative data column, in both the RGui and RStudio environments. The data in question is the "Auto" data from "An Introduction to Statistical Learning, with Applications in R" (www.StatLearning.com). The issue in the "name" column is present whether I use the "Auto.csv" or "Auto.data" files from the book's website. What's interesting is that the RGui correctly characterizes the "horsepower" column, but RStudio does not. Again, neither correctly characterize the "name" column. Any help to correct this situation would be greatly appreciated.
Upvotes: 0
Views: 425
Reputation: 2374
There are two unrelated issues here. One is that horsepower
has missings values encoded as "?". read.csv()
then reads horsepower
as a character vector and not as numeric one. The argument na.strings = "?"
will fix this.
The other issue is that since version 4.0.0,
R now uses a stringsAsFactors = FALSE default, and hence by default no longer converts strings to factors in calls to data.frame() and read.table().
With that, scripts that omits the stringsAsFactors
will have different results in older versions. Previous to version 4.0.0 it will convert to factors automatically and since 4.0.0 in will read as character per default. If you wish to convert to factors just set stringsAsFactors = TRUE
. (Or convert it to factors later on with as.factor()
).
The reasoning behind the change explained in depth here. The most compelling reason, in my opinion, is due to reproducibility issues when automatically converting to factors.
When creating a factor from a character vector, if the levels are not given explicitly the sorted unique values are used for the levels, and of course the result of sorting is locale-dependent
So, if you wish to convert to factors and be sure the same script will produce the same results regardless of your locale (i.e language settings), it is advisable to manually convert to factors and set the levels explicitly.
url <- "https://www.statlearning.com/s/Auto.csv"
df_factor <- read.csv(url, stringsAsFactors = TRUE, na.strings = "?")
summary(df_factor)
#> mpg cylinders displacement horsepower weight
#> Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
#> 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.0 1st Qu.: 75.0 1st Qu.:2223
#> Median :23.00 Median :4.000 Median :146.0 Median : 93.5 Median :2800
#> Mean :23.52 Mean :5.458 Mean :193.5 Mean :104.5 Mean :2970
#> 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 3rd Qu.:126.0 3rd Qu.:3609
#> Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
#> NA's :5
#> acceleration year origin name
#> Min. : 8.00 Min. :70.00 Min. :1.000 ford pinto : 6
#> 1st Qu.:13.80 1st Qu.:73.00 1st Qu.:1.000 amc matador : 5
#> Median :15.50 Median :76.00 Median :1.000 ford maverick : 5
#> Mean :15.56 Mean :75.99 Mean :1.574 toyota corolla: 5
#> 3rd Qu.:17.10 3rd Qu.:79.00 3rd Qu.:2.000 amc gremlin : 4
#> Max. :24.80 Max. :82.00 Max. :3.000 amc hornet : 4
#> (Other) :368
df_string <- read.csv(url, stringsAsFactors = FALSE, na.strings = "?")
summary(df_string)
#> mpg cylinders displacement horsepower weight
#> Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
#> 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.0 1st Qu.: 75.0 1st Qu.:2223
#> Median :23.00 Median :4.000 Median :146.0 Median : 93.5 Median :2800
#> Mean :23.52 Mean :5.458 Mean :193.5 Mean :104.5 Mean :2970
#> 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 3rd Qu.:126.0 3rd Qu.:3609
#> Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
#> NA's :5
#> acceleration year origin name
#> Min. : 8.00 Min. :70.00 Min. :1.000 Length:397
#> 1st Qu.:13.80 1st Qu.:73.00 1st Qu.:1.000 Class :character
#> Median :15.50 Median :76.00 Median :1.000 Mode :character
#> Mean :15.56 Mean :75.99 Mean :1.574
#> 3rd Qu.:17.10 3rd Qu.:79.00 3rd Qu.:2.000
#> Max. :24.80 Max. :82.00 Max. :3.000
#>
Created on 2021-03-21 by the reprex package (v1.0.0)
Upvotes: 1