Reputation: 41
Well, I've read the function reference for step_num2factor and didn't figured it out how to use it properly, honestly.
temp_names <- as.character(unique(sort(all_raw$MSSubClass)))
price_recipe <-
recipe(SalePrice ~ . , data = train_raw) %>%
step_num2factor(MSSubClass, levels = temp_names)
temp_rec <- prep(price_recipe, training = train_raw, strings_as_factors = FALSE) # temporary recipe
temp_data <- bake(temp_rec, new_data = all_raw) # temporary data
class(all_raw$MSSubClass)
# > col_double()
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
The data output temp_data$MSSubClass
is full of NA after the use of the step.
The obs are saved as 20,30,40.... 190 and I want to transform to names ( or even the same numbers but as unordered factors)
If you know more blog posts about the usage of step_num2factor or some code that uses, I would be gladly to see as well.
The complete dataset is provided by kaggle at: kaggle data
Thx in advance,
Upvotes: 0
Views: 650
Reputation: 11613
I don't think that step_num2factor()
is the best fit for this variable. Take a look at the help again, and notice that you need to give a transform
argument that can be used to modify the numeric values prior to determining the levels. This would work OK if this data was all multiples of 10, but you have some values like 75 and 85, so I don't think you want that. This recipe step works best for numeric/integer-ish variables that you can more easily transform to a set of integers with a simple function.
Instead, I think you should think about step_mutate()
and a simple coercion to a factor type:
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
train_raw <- read_csv("~/Downloads/house-prices-advanced-regression-techniques/train.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_character(),
#> Id = col_double(),
#> MSSubClass = col_double(),
#> LotFrontage = col_double(),
#> LotArea = col_double(),
#> OverallQual = col_double(),
#> OverallCond = col_double(),
#> YearBuilt = col_double(),
#> YearRemodAdd = col_double(),
#> MasVnrArea = col_double(),
#> BsmtFinSF1 = col_double(),
#> BsmtFinSF2 = col_double(),
#> BsmtUnfSF = col_double(),
#> TotalBsmtSF = col_double(),
#> `1stFlrSF` = col_double(),
#> `2ndFlrSF` = col_double(),
#> LowQualFinSF = col_double(),
#> GrLivArea = col_double(),
#> BsmtFullBath = col_double(),
#> BsmtHalfBath = col_double(),
#> FullBath = col_double()
#> # ... with 18 more columns
#> )
#> See spec(...) for full column specifications.
price_recipe <-
recipe(SalePrice ~ ., data = train_raw) %>%
step_mutate(MSSubClass = factor(MSSubClass))
juiced_price <- prep(price_recipe) %>%
juice()
levels(juiced_price$MSSubClass)
#> [1] "20" "30" "40" "45" "50" "60" "70" "75" "80" "85" "90" "120"
#> [13] "160" "180" "190"
juiced_price %>%
count(MSSubClass)
#> # A tibble: 15 x 2
#> MSSubClass n
#> <fct> <int>
#> 1 20 536
#> 2 30 69
#> 3 40 4
#> 4 45 12
#> 5 50 144
#> 6 60 299
#> 7 70 60
#> 8 75 16
#> 9 80 58
#> 10 85 20
#> 11 90 52
#> 12 120 87
#> 13 160 63
#> 14 180 10
#> 15 190 30
Created on 2020-05-03 by the reprex package (v0.3.0)
This looks to me like it gets you the factor levels you want. If you want to save those strings from the .txt
file like "1-STORY 1945 & OLDER" as a new_levels
vector, you could say factor(MSSubClass, levels = new_levels)
.
Upvotes: 1