drstevok
drstevok

Reputation: 715

Can I programmatically update the type of a set of columns (to factors) in data.table?

I would like to modify a set of columns inside a data.table to be factors. If I knew the names of the columns in advance, I think this would be straightforward.

library(data.table)
dt1  <- data.table(a = (1:4), b = rep(c('a','b')), c = rep(c(0,1)))
dt1[,class(b)]
dt1[,b:=factor(b)]
dt1[,class(b)]

But I don't, and instead have a list of the variable names

vars.factors  <- c('b','c')

I can apply the factor function to them without a problem ...

lapply(vars.factors, function(x) dt1[,class(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])

But I don't know how to re-assign or update the original column in the data table.

This fails ...

  lapply(vars.factors, function(x) dt1[,x:=factor(get(x))])
  # Error in get(x) : invalid first argument 

As does this ...

  lapply(vars.factors, function(x) dt1[,get(x):=factor(get(x))])
  # Error in get(x) : object 'b' not found 

NB. I tried the answer proposed here without any luck.

Upvotes: 5

Views: 607

Answers (3)

JWilliman
JWilliman

Reputation: 3883

Can also do

for (col in vars.factors) 
  set(dt, j=col, value=as.factor(dt1[[col]]))

vars.factors may be a vector of integers or character names specifying the columns to modify.

See https://stackoverflow.com/a/33000778/4241780 for more info.

Upvotes: 1

rnso
rnso

Reputation: 24535

Using data frame:

> df1 = data.frame(dt1)
> df1[,vars.factors] = data.frame(sapply(df1[,vars.factors], factor))
> dt1 = data.table(df1)

> dt1
   a b c
1: 1 1 b
2: 2 2 c
3: 3 3 b
4: 4 4 c

> str(dt1)
Classes ‘data.table’ and 'data.frame':  4 obs. of  3 variables:
 $ a: int  1 2 3 4
 $ b: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
 $ c: Factor w/ 2 levels "b","c": 1 2 1 2
 - attr(*, ".internal.selfref")=<externalptr> 

Upvotes: 2

Arun
Arun

Reputation: 118779

Yes, this is fairly straightforward:

dt1[, (vars.factors) := lapply(.SD, as.factor), .SDcols=vars.factors]

In the LHS (of := in j), we specify the names of the columns. If a column already exists, it'll be updated, else, a new column will be created. In the RHS, we loop over all the columns in .SD (which stands for Subset of Data), and we specify the columns that should be in .SD with the .SDcols argument.

Following up on comment:

Note that we need to wrap LHS with () for it to be evaluated and fetch the column names within vars.factors variable. This is because we allow the syntax

DT[, col := value]

when there's only one column to assign, by specifying the column name as a symbol (without quotes), purely for convenience. This creates a column named col and assigns value to it.

To distinguish these two cases apart, we need the (). Wrapping it with () is sufficient to identify that we really need to get the values within the variable.

Upvotes: 12

Related Questions