Walber Moreira
Walber Moreira

Reputation: 41

step_num2factor() Usage -- Tidymodel (Recipe Package)

Well, I've read the function reference for step_num2factor and didn't figured it out how to use it properly, honestly.

temp_names <- as.character(unique(sort(all_raw$MSSubClass)))

price_recipe <-
     recipe(SalePrice ~ . , data = train_raw) %>%
step_num2factor(MSSubClass, levels  = temp_names)


temp_rec <- prep(price_recipe, training = train_raw, strings_as_factors = FALSE) # temporary recipe
temp_data <- bake(temp_rec, new_data = all_raw) # temporary data

class(all_raw$MSSubClass)
# > col_double() 
MSSubClass: Identifies the type of dwelling involved in the sale.

    20  1-STORY 1946 & NEWER ALL STYLES
    30  1-STORY 1945 & OLDER
    40  1-STORY W/FINISHED ATTIC ALL AGES
    45  1-1/2 STORY - UNFINISHED ALL AGES
    50  1-1/2 STORY FINISHED ALL AGES
    60  2-STORY 1946 & NEWER
    70  2-STORY 1945 & OLDER
    75  2-1/2 STORY ALL AGES
    80  SPLIT OR MULTI-LEVEL
    85  SPLIT FOYER
    90  DUPLEX - ALL STYLES AND AGES
   120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150  1-1/2 STORY PUD - ALL AGES
   160  2-STORY PUD - 1946 & NEWER
   180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190  2 FAMILY CONVERSION - ALL STYLES AND AGES

The data output temp_data$MSSubClass is full of NA after the use of the step. The obs are saved as 20,30,40.... 190 and I want to transform to names ( or even the same numbers but as unordered factors)

If you know more blog posts about the usage of step_num2factor or some code that uses, I would be gladly to see as well.

The complete dataset is provided by kaggle at: kaggle data

Thx in advance,

Upvotes: 0

Views: 650

Answers (1)

Julia Silge
Julia Silge

Reputation: 11613

I don't think that step_num2factor() is the best fit for this variable. Take a look at the help again, and notice that you need to give a transform argument that can be used to modify the numeric values prior to determining the levels. This would work OK if this data was all multiples of 10, but you have some values like 75 and 85, so I don't think you want that. This recipe step works best for numeric/integer-ish variables that you can more easily transform to a set of integers with a simple function.

Instead, I think you should think about step_mutate() and a simple coercion to a factor type:

library(tidyverse)
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#> 
#>     fixed
#> The following object is masked from 'package:stats':
#> 
#>     step

train_raw <- read_csv("~/Downloads/house-prices-advanced-regression-techniques/train.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_character(),
#>   Id = col_double(),
#>   MSSubClass = col_double(),
#>   LotFrontage = col_double(),
#>   LotArea = col_double(),
#>   OverallQual = col_double(),
#>   OverallCond = col_double(),
#>   YearBuilt = col_double(),
#>   YearRemodAdd = col_double(),
#>   MasVnrArea = col_double(),
#>   BsmtFinSF1 = col_double(),
#>   BsmtFinSF2 = col_double(),
#>   BsmtUnfSF = col_double(),
#>   TotalBsmtSF = col_double(),
#>   `1stFlrSF` = col_double(),
#>   `2ndFlrSF` = col_double(),
#>   LowQualFinSF = col_double(),
#>   GrLivArea = col_double(),
#>   BsmtFullBath = col_double(),
#>   BsmtHalfBath = col_double(),
#>   FullBath = col_double()
#>   # ... with 18 more columns
#> )
#> See spec(...) for full column specifications.

price_recipe <-
  recipe(SalePrice ~ ., data = train_raw) %>%
  step_mutate(MSSubClass = factor(MSSubClass))

juiced_price <- prep(price_recipe) %>%
  juice()

levels(juiced_price$MSSubClass)
#>  [1] "20"  "30"  "40"  "45"  "50"  "60"  "70"  "75"  "80"  "85"  "90"  "120"
#> [13] "160" "180" "190"

juiced_price %>%
  count(MSSubClass)
#> # A tibble: 15 x 2
#>    MSSubClass     n
#>    <fct>      <int>
#>  1 20           536
#>  2 30            69
#>  3 40             4
#>  4 45            12
#>  5 50           144
#>  6 60           299
#>  7 70            60
#>  8 75            16
#>  9 80            58
#> 10 85            20
#> 11 90            52
#> 12 120           87
#> 13 160           63
#> 14 180           10
#> 15 190           30

Created on 2020-05-03 by the reprex package (v0.3.0)

This looks to me like it gets you the factor levels you want. If you want to save those strings from the .txt file like "1-STORY 1945 & OLDER" as a new_levels vector, you could say factor(MSSubClass, levels = new_levels).

Upvotes: 1

Related Questions