nhern121
nhern121

Reputation: 3921

How to do this using data.table?

I have a data table (DatosMex) in R and would like to recode a column within it named industry. The distinct categories for this variable are:

  Agricultura,Ganaderia,Pesca,Caza Forestal                      
  Asociaciones                                                       
  Comercio                                                       
  Construccion                                                   
  Energia,Petroleo,Gas,Mineria                                   
  Gobierno                                                       
  Industria                                                      
  N/A                                                            
  NULL                                                           
  Servicios                      

I want to create a new variable, say gr_industry, that groups some categories. For instance, my new variable must group the categories Agricultura,Ganaderia,Pesca,Caza Forestal, Asociaciones,Energia,Petroleo,Gas,Mineria and Gobienro and assign them the code 1.

How would you do this using the data.table package syntax?

My approach was this:

 #Create an id for each industry
 DatosMex[,cod_industria:=as.numeric(DatosMex$industry)] 
 #Create a new data table
 ind =data.table(cod_industria=c(1:10),gr_industry=c(1,1,2,3,1,1,4,6,6,5))
 setkey(DatosMex,cod_industria)
 setkey(ind,cod_industria)
 DatosMex[ind] 

So, as you can see, I had to create a new data table ind and then do the inner join. My question is: is there another way of doing this using the data.table way? I don't want to create a table each time I need to do something similar. Also, I'd like to avoid using if statements.

Upvotes: 1

Views: 231

Answers (2)

IRTFM
IRTFM

Reputation: 263332

I'm guessing one does not need to set a key or create a new data.table. The [ function is generally very fast, especially in datatable-objects:

 DatosMex[, gr_industry := c(1,1,2,3,1,1,4,6,6,5)[cod_industria] ]

If that grouping translation vector is large then you can refer to it by name, even if it is outside the data.table.

 dta <- data.table(a=sample(1:10, 20, repl=TRUE))
 g6<- c(1,1,2,3,1,1,4,6,6,5)
 dta[ , ind := g6[a] ]
 #-------------------
     a ind
 1:  8   6
 2:  4   3
 3: 10   5
 4:  8   6
 snipped output

Upvotes: 4

mnel
mnel

Reputation: 115392

From an code organization point of view, you need to define the recoding at some point, either

  • in a data.table or
  • a switch function.

Here is a switch function example

  ## a function that will `switch` based on the levels 1:10
  ## note that it is Vectorized (to avoid calling `sapply`
  switch_industry <- Vectorize(function(i) { switch(i, 1,1,2,3,1,1,4,6,6,5)})


  DatosMex[, gr_industry := switch_industry(cod_industria)]

I would not call this a data.table-specific solution.

Upvotes: 2

Related Questions