vijay
vijay

Reputation: 113

data.table() still converts strings to factors?

From what I can see here I would assume that data.table v1.8.0+ does not automatically convert strings to factors.

Specifically, to quote Matthew Dowle from that page:

No need for stringsAsFactors. Done like this in v1.8.0 : o character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported.

I'm not seeing that ... here's my R session transcript:

First, I make sure I have a recent enough version of data.table > 1.8.0

> library(data.table)
data.table 1.8.8  For help type: help("data.table")

Next, I create a 2x2 data.table. Notice that it creates factors ...

> m <- matrix(letters[1:4], ncol=2)
> str(data.table(m))
Classes ‘data.table’ and 'data.frame':  2 obs. of  2 variables:
 $ V1: Factor w/ 2 levels "a","b": 1 2
 $ V2: Factor w/ 2 levels "c","d": 1 2
 - attr(*, ".internal.selfref")=<externalptr> 

When I use stringsAsFactors in data.frame() and then call data.table(), all is well ...

> str(data.table(data.frame(m, stringsAsFactors=FALSE)))
Classes ‘data.table’ and 'data.frame':  2 obs. of  2 variables:
 $ X1: chr  "a" "b"
 $ X2: chr  "c" "d"
 - attr(*, ".internal.selfref")=<externalptr> 

What am I missing? Is data.frame() supposed to convert strings to factors, and if so, is there a "better way" of turning that behavior off?

Thanks!

Upvotes: 11

Views: 7556

Answers (2)

mnel
mnel

Reputation: 115392

Update:

This issue seems to have slipped past somehow until now. Thanks to @fpinter for filing the issue recently. It is now fixed in commit 1322. From NEWS, No:39 under bug fixes for v1.9.3:

as.data.table.matrix does not convert strings to factors by default. data.table likes and prefers using character vectors to factors. Closes #745. Thanks to @fpinter for reporting the issue on the github issue tracker and to vijay for reporting here on SO.


It appears that this non-coercion is not yet implemented.

data.table deals with matrix arguments using as.data.table

if (is.matrix(xi) || is.data.frame(xi)) {
            xi = as.data.table(xi, keep.rownames = keep.rownames)
            x[[i]] = xi
            numcols[i] = length(xi)
        }

and

as.data.table.matrix

contains

if (mode(x) == "character") {
        for (i in ic) value[[i]] <- as.factor(x[, i])
    }

Might be worth reporting this to the bug tracker. (it is still implemented in 1.8.9, the current r-forge version)

Upvotes: 10

dickoa
dickoa

Reputation: 18437

As a workaround and to complete @mnel answer, if you want to turn off the default behavior of data.frame you can use the dedicated option.

options(stringsAsFactors=FALSE)

str(data.table(data.frame(m)))
Classes ‘data.table’ and 'data.frame':  2 obs. of  2 variables:
 $ X1: chr  "a" "b"
 $ X2: chr  "c" "d"
 - attr(*, ".internal.selfref")=<externalptr> 

Upvotes: 6

Related Questions