Reputation: 31

How do I add a column to a data.table and return multiple columns without modifying underlying data?

I have the following data.table in R

dt <- data.table(gender = c("Male", "Female"), Prop = c(0.49, 0.51))
#   gender Prop
# 1:   Male  0.49
# 2: Female  0.51

I want to calculate a Freq = Prop * 1000 column and then return just the gender and Freq columns. How can I do this in a single line of code and without explicitly referring to the gender column and without modifying dt?

The best I can manage is:

onsdist$gender[, c(.SD, Freq = Prop * 1000)][, .SD, .SDcols = - "Prop"]
#    gender Freq1 Freq2
# 1:   Male   490   490
# 2: Female   510   510

but I've ended up with a duplicated Freq column.

(The reason I don't want to refer to gender is because it changes across data.tables. The reason I don't want to modify dt is because I need to re-use the original version later).

Upvotes: 3

Answers (3)

G. Grothendieck

Reputation: 270268

1) Use transform with Prop = NULL

dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)]
##    gender Freq
## 1:   Male  490
## 2: Female  510

2) or this variation

transform(dt, Freq = Prop * 1000, Prop = NULL)
##    gender Freq
## 1:   Male  490
## 2: Female  510

3) We can speed it up substantially by just replacing transform with ftransform found in the collapse package.

library(collapse)
dt[, ftransform(.SD, Freq = Prop * 1000, Prop = NULL)]

4) Similarly

library(collapse)
ftransform(dt, Freq = Prop * 1000, Prop = NULL)

Benchmark

Using the data in the question we see that (4) above, labelled ex4 below, which uses ftransform from collapse without [.data.table is much faster than the others above.

library(collapse)
library(data.table)
library(microbenchmark)

microbenchmark(
    ex1 = dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)],
    ex2 = transform(dt, Freq = Prop * 1000, Prop = NULL),
    ex3 = dt[, ftransform(.SD, Freq = Prop * 1000, Prop = NULL)],
    ex4 = ftransform(dt, Freq = Prop * 1000, Prop = NULL)
)

Unit: microseconds
 expr      min       lq       mean    median       uq      max neval  cld
  ex1 1847.601 1927.402 2046.04098 2015.4015 2093.251 2706.200   100    d
  ex2  959.700 1000.701 1074.93098 1046.1510 1122.601 1606.201   100  b  
  ex3 1048.201 1090.351 1139.57598 1121.6005 1174.201 1381.602   100   c 
  ex4   68.401   85.551   93.08802   89.2515  100.551  168.400   100 a

Upvotes: 5

Merijn van Tilborg

Reputation: 5897

Another solution

dt[, .(dt[, 1], Freq = Prop * 1000)]

   gender Freq
1:   Male  490
2: Female  510

Some benchmark of the options given in all answers

Note that I increased the sample data quite a bit, but I was just curious about the differences between the methods for other data sets as well.

Transform is very slow here and not recommended, the other methods are quite similar and the power of .SD and .SDcols are the fastest, although in this case, keeping all your rows and do not update anything by reference using the first method is hardly slower.

set.seed(42)

dt <- data.table(
  gender = rep(LETTERS[1:25], 40000),
  Prop = runif(n = 1000000))

library(rbenchmark)

benchmark(
  "dt[, .(dt[, 1], Freq = Prop * 1000)]" = {
    dt[, .(dt[, 1], Freq = Prop * 1000)]
  },
  "dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = 1]" = {
    dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = 1]
  },
  "dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = -\"Prop\"]" = {
    dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = -"Prop"]
  },
  "dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)]" = {
    dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)]
  },
  "transform(dt, Freq = Prop * 1000, Prop = NULL)" = {
    transform(dt, Freq = Prop * 1000, Prop = NULL)
  },
  replications = 1000,
  columns = c("test", "replications", "elapsed", "relative")
)

#                                                     test replications elapsed relative
# 1                   dt[, .(dt[, 1], Freq = Prop * 1000)]         1000   18.66    1.112
# 3 dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = -"Prop"]         1000   17.02    1.014
# 2       dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = 1]         1000   16.78    1.000
# 4  dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)]         1000  333.51   19.875
# 5         transform(dt, Freq = Prop * 1000, Prop = NULL)         1000  329.41   19.631

Sidenote

Keep in mind that creating the column by reference is like a 5-fold faster dt[, Freq := Prop * 1000] and OP uses the argument the table is re-used later. I would suggest always to do all calculations and preparations by reference on the table when it gains in speed. You can always subset your output from there.

#                                               test replications elapsed relative
# 1             dt[, .(dt[, 1], Freq = Prop * 1000)]         1000   16.25    5.783
# 2 dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = 1]         1000   13.33    4.744
# 3                         t[, Freq := Prop * 1000]         1000    2.81    1.000

Upvotes: 1

akrun

Reputation: 887911

We can use the data.table syntax to get the output format

dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = -"Prop"]

-output

   gender Freq
1:   Male  490
2: Female  510

Upvotes: 3

How do I add a column to a data.table and return multiple columns without modifying underlying data?

Answers (3)

Benchmark

Related Questions