Dan
Dan

Reputation: 12074

Malformed factor when using cut and data.table

I have a data.table that contains a column called values. I'd like to create some factors based on this column using cut. Some intervals will have the same factor (i.e., NA), others will not. For example,

# Set RNG seed
set.seed(-1)

# Load library
library(data.table)

# Create data table
dt <- data.table(values = runif(1000))

# Divide vector into groups
dt[, group := cut(values,
                 breaks = c(-Inf, 0.2, 0.4, 0.6, 0.8, Inf),
                 labels = c(NA, "foo", NA, "bar", NA))]

dt
#> Error in as.character.factor(x): malformed factor

Created on 2019-09-25 by the reprex package (v0.3.0)

As you can see, this produces an error:

Error in as.character.factor(x): malformed factor

When I do the cut outside of data.table, it seems to work fine:

# Set RNG seed
set.seed(-1)

# Load library
library(data.table)

# Create data table
dt <- data.table(values = runif(1000))

# Outside of data table
cut(dt$values,
    breaks = c(-Inf, 0.2, 0.4, 0.6, 0.8, Inf),
    labels = c(NA, "foo", NA, "bar", NA))
#>    [1] <NA> <NA> <NA> <NA> foo  <NA> foo  <NA> foo  foo  bar  foo  foo 
#>   [14] <NA> foo  <NA> <NA> <NA> foo  bar  foo  <NA> foo  bar  foo  foo 
#>   [27] foo  <NA> <NA> <NA> <NA> bar  <NA> <NA> bar  bar  foo  foo  foo 
#>   [40] <NA> <NA> <NA> foo  <NA> <NA> <NA> <NA> foo  <NA> foo  bar  bar 
#>   [53] <NA> foo  <NA> <NA> foo  <NA> foo  <NA> foo  <NA> <NA> <NA> <NA>
#>   [66] foo  foo  <NA> bar  bar  <NA> <NA> <NA> foo  bar  bar  <NA> <NA>
#>   [79] <NA> <NA> foo  bar  bar  bar  bar  bar  <NA> bar  <NA> <NA> <NA>
#>   [92] <NA> <NA> <NA> <NA> <NA> foo  <NA> foo  foo  foo  <NA> <NA> <NA>
#>  [105] foo  <NA> foo  <NA> bar  <NA> <NA> <NA> foo  bar  <NA> bar  foo 
#>  [118] foo  <NA> <NA> <NA> <NA> <NA> <NA> bar  <NA> <NA> <NA> <NA> <NA>
#>  [131] <NA> foo  <NA> <NA> <NA> bar  <NA> <NA> foo  foo  <NA> <NA> foo 
#>  [144] <NA> <NA> <NA> <NA> <NA> <NA> <NA> foo  bar  bar  <NA> <NA> <NA>
#>  [157] <NA> foo  <NA> <NA> foo  bar  bar  foo  <NA> <NA> <NA> foo  <NA>
#>  [170] <NA> <NA> bar  <NA> <NA> <NA> <NA> <NA> foo  foo  <NA> <NA> foo 
#>  [183] <NA> <NA> <NA> foo  bar  <NA> foo  <NA> bar  foo  <NA> <NA> bar 
#>  [196] foo  <NA> <NA> foo  bar  <NA> <NA> bar  <NA> <NA> bar  bar  <NA>
#>  [209] <NA> bar  bar  bar  <NA> <NA> foo  bar  <NA> bar  <NA> bar  foo 
#>  [222] bar  <NA> <NA> foo  bar  bar  bar  foo  <NA> bar  <NA> <NA> <NA>
#>  [235] <NA> bar  <NA> foo  foo  foo  <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#>  [248] <NA> <NA> <NA> <NA> <NA> foo  <NA> bar  <NA> bar  <NA> bar  bar 
#>  [261] <NA> <NA> <NA> <NA> foo  <NA> <NA> <NA> <NA> foo  <NA> bar  <NA>
#>  [274] <NA> <NA> <NA> <NA> bar  <NA> <NA> bar  bar  bar  foo  <NA> foo 
#>  [287] foo  <NA> <NA> <NA> <NA> bar  <NA> <NA> <NA> foo  foo  <NA> <NA>
#>  [300] foo  <NA> <NA> <NA> bar  <NA> <NA> <NA> <NA> <NA> <NA> <NA> bar 
#>  [313] foo  bar  <NA> <NA> <NA> <NA> foo  <NA> <NA> <NA> <NA> <NA> <NA>
#>  [326] <NA> <NA> foo  bar  <NA> foo  bar  <NA> bar  bar  <NA> <NA> bar 
#>  [339] <NA> <NA> <NA> <NA> <NA> <NA> bar  foo  <NA> <NA> <NA> bar  <NA>
#>  [352] bar  foo  <NA> foo  <NA> <NA> foo  <NA> <NA> <NA> bar  <NA> foo 
#>  [365] foo  <NA> <NA> <NA> bar  <NA> <NA> <NA> bar  foo  foo  foo  <NA>
#>  [378] <NA> <NA> <NA> <NA> foo  <NA> <NA> <NA> foo  <NA> bar  bar  <NA>
#>  [391] bar  bar  <NA> foo  <NA> bar  <NA> bar  <NA> foo  <NA> foo  foo 
#>  [404] <NA> <NA> <NA> <NA> <NA> foo  foo  bar  <NA> bar  foo  <NA> foo 
#>  [417] <NA> bar  <NA> <NA> foo  <NA> <NA> <NA> <NA> <NA> bar  foo  bar 
#>  [430] <NA> <NA> bar  foo  <NA> <NA> <NA> <NA> <NA> <NA> foo  <NA> <NA>
#>  [443] <NA> foo  <NA> bar  <NA> foo  foo  bar  <NA> <NA> <NA> bar  <NA>
#>  [456] foo  <NA> <NA> <NA> <NA> foo  <NA> <NA> bar  foo  foo  <NA> <NA>
#>  [469] <NA> <NA> bar  <NA> foo  foo  <NA> <NA> <NA> <NA> foo  <NA> <NA>
#>  [482] bar  foo  bar  <NA> <NA> foo  <NA> foo  foo  <NA> <NA> <NA> <NA>
#>  [495] foo  <NA> <NA> <NA> <NA> foo  foo  bar  foo  <NA> <NA> <NA> <NA>
#>  [508] <NA> <NA> <NA> <NA> <NA> <NA> foo  <NA> foo  bar  bar  <NA> foo 
#>  [521] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> foo  <NA> bar  bar  foo 
#>  [534] <NA> foo  foo  bar  <NA> <NA> <NA> bar  <NA> <NA> foo  bar  bar 
#>  [547] <NA> <NA> <NA> bar  foo  bar  <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#>  [560] <NA> <NA> bar  <NA> <NA> <NA> foo  <NA> <NA> <NA> <NA> <NA> <NA>
#>  [573] foo  <NA> foo  <NA> bar  foo  foo  bar  <NA> <NA> <NA> <NA> bar 
#>  [586] foo  <NA> foo  <NA> bar  <NA> <NA> foo  <NA> <NA> <NA> <NA> <NA>
#>  [599] foo  <NA> <NA> foo  <NA> bar  foo  <NA> <NA> <NA> bar  <NA> bar 
#>  [612] foo  foo  bar  <NA> <NA> bar  bar  foo  bar  <NA> <NA> <NA> bar 
#>  [625] <NA> foo  <NA> bar  <NA> <NA> <NA> <NA> foo  bar  bar  <NA> foo 
#>  [638] <NA> bar  <NA> <NA> <NA> foo  <NA> foo  bar  <NA> bar  <NA> <NA>
#>  [651] <NA> <NA> bar  foo  <NA> <NA> bar  <NA> foo  foo  foo  <NA> foo 
#>  [664] <NA> foo  <NA> <NA> <NA> <NA> <NA> <NA> <NA> bar  <NA> <NA> <NA>
#>  [677] foo  <NA> <NA> bar  bar  <NA> foo  <NA> <NA> <NA> <NA> <NA> bar 
#>  [690] <NA> <NA> foo  bar  foo  <NA> <NA> <NA> bar  foo  bar  <NA> bar 
#>  [703] <NA> <NA> foo  <NA> <NA> bar  <NA> <NA> foo  <NA> <NA> <NA> bar 
#>  [716] foo  bar  <NA> foo  bar  <NA> <NA> <NA> bar  <NA> <NA> <NA> bar 
#>  [729] <NA> foo  foo  <NA> <NA> bar  <NA> bar  foo  <NA> <NA> <NA> <NA>
#>  [742] bar  <NA> <NA> foo  foo  <NA> <NA> <NA> <NA> <NA> <NA> bar  foo 
#>  [755] <NA> foo  <NA> <NA> <NA> <NA> bar  foo  <NA> <NA> <NA> foo  bar 
#>  [768] bar  <NA> <NA> <NA> <NA> <NA> bar  foo  foo  bar  <NA> <NA> bar 
#>  [781] foo  foo  <NA> <NA> foo  foo  bar  <NA> foo  bar  <NA> foo  <NA>
#>  [794] foo  <NA> bar  <NA> foo  foo  <NA> <NA> bar  foo  <NA> foo  <NA>
#>  [807] <NA> <NA> <NA> <NA> bar  foo  <NA> foo  foo  bar  <NA> bar  <NA>
#>  [820] <NA> bar  bar  <NA> bar  <NA> <NA> foo  bar  <NA> <NA> <NA> bar 
#>  [833] <NA> foo  foo  <NA> foo  <NA> <NA> <NA> <NA> bar  foo  bar  bar 
#>  [846] bar  <NA> <NA> <NA> foo  bar  foo  <NA> <NA> bar  <NA> foo  <NA>
#>  [859] <NA> foo  <NA> <NA> bar  bar  bar  <NA> foo  <NA> <NA> <NA> <NA>
#>  [872] foo  <NA> <NA> foo  <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#>  [885] <NA> <NA> bar  bar  <NA> <NA> <NA> <NA> foo  <NA> bar  <NA> <NA>
#>  [898] <NA> bar  <NA> <NA> <NA> <NA> <NA> foo  foo  <NA> <NA> <NA> foo 
#>  [911] <NA> bar  bar  bar  bar  bar  <NA> <NA> bar  foo  bar  <NA> <NA>
#>  [924] <NA> <NA> <NA> foo  bar  bar  bar  foo  <NA> <NA> foo  foo  <NA>
#>  [937] bar  <NA> <NA> bar  <NA> bar  <NA> <NA> <NA> bar  <NA> <NA> <NA>
#>  [950] bar  <NA> foo  <NA> <NA> foo  bar  <NA> <NA> <NA> <NA> <NA> <NA>
#>  [963] foo  <NA> foo  <NA> <NA> <NA> <NA> <NA> foo  <NA> bar  foo  <NA>
#>  [976] bar  bar  <NA> bar  <NA> foo  <NA> <NA> foo  <NA> <NA> bar  foo 
#>  [989] <NA> <NA> <NA> bar  foo  bar  foo  bar  <NA> <NA> bar  <NA>
#> Levels: <NA> foo bar

Created on 2019-09-25 by the reprex package (v0.3.0)

Why is data.table getting upset?


sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.12.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.2      stringi_1.4.3   rmarkdown_1.15 
#>  [9] highr_0.8       knitr_1.25      stringr_1.4.0   xfun_0.9       
#> [13] digest_0.6.21   evaluate_0.14

Upvotes: 1

Views: 518

Answers (1)

Dan
Dan

Reputation: 12074

Here's a solution based on @sindri_baldur's insights above.

# Set RNG seed
set.seed(-1)

# Load library
library(data.table)

# Create data table
dt <- data.table(values = runif(1000))

# Divide vector into groups
dt[, group := factor(cut(values,
                     breaks = c(-Inf, 0.2, 0.4, 0.6, 0.8, Inf),
                     labels = c(NA, "foo", NA, "bar", NA)))]

dt
#>          values group
#>    1: 0.4866672  <NA>
#>    2: 0.1913653  <NA>
#>    3: 0.9932719  <NA>
#>    4: 0.1467027  <NA>
#>    5: 0.2415895   foo
#>   ---                
#>  996: 0.6428781   bar
#>  997: 0.4525126  <NA>
#>  998: 0.9631253  <NA>
#>  999: 0.7285391   bar
#> 1000: 0.1713554  <NA>

Created on 2019-09-26 by the reprex package (v0.3.0)

factor by default omits NA when creating levels, which seems to make data.table happy.


Edit

This issue was resolved by bug fix #45 of v1.12.4, as detailed here.

Upvotes: 1

Related Questions