Reputation: 31
I have the following data.table
in R
dt <- data.table(gender = c("Male", "Female"), Prop = c(0.49, 0.51))
# gender Prop
# 1: Male 0.49
# 2: Female 0.51
I want to calculate a Freq = Prop * 1000
column and then return just the gender
and Freq
columns. How can I do this in a single line of code and without explicitly referring to the gender
column and without modifying dt
?
The best I can manage is:
onsdist$gender[, c(.SD, Freq = Prop * 1000)][, .SD, .SDcols = - "Prop"]
# gender Freq1 Freq2
# 1: Male 490 490
# 2: Female 510 510
but I've ended up with a duplicated Freq
column.
(The reason I don't want to refer to gender
is because it changes across data.table
s. The reason I don't want to modify dt
is because I need to re-use the original version later).
Upvotes: 3
Views: 104
Reputation: 270268
1) Use transform with Prop = NULL
dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)]
## gender Freq
## 1: Male 490
## 2: Female 510
2) or this variation
transform(dt, Freq = Prop * 1000, Prop = NULL)
## gender Freq
## 1: Male 490
## 2: Female 510
3) We can speed it up substantially by just replacing transform with ftransform found in the collapse package.
library(collapse)
dt[, ftransform(.SD, Freq = Prop * 1000, Prop = NULL)]
4) Similarly
library(collapse)
ftransform(dt, Freq = Prop * 1000, Prop = NULL)
Using the data in the question we see that (4) above, labelled ex4 below, which uses ftransform from collapse without [.data.table is much faster than the others above.
library(collapse)
library(data.table)
library(microbenchmark)
microbenchmark(
ex1 = dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)],
ex2 = transform(dt, Freq = Prop * 1000, Prop = NULL),
ex3 = dt[, ftransform(.SD, Freq = Prop * 1000, Prop = NULL)],
ex4 = ftransform(dt, Freq = Prop * 1000, Prop = NULL)
)
Unit: microseconds
expr min lq mean median uq max neval cld
ex1 1847.601 1927.402 2046.04098 2015.4015 2093.251 2706.200 100 d
ex2 959.700 1000.701 1074.93098 1046.1510 1122.601 1606.201 100 b
ex3 1048.201 1090.351 1139.57598 1121.6005 1174.201 1381.602 100 c
ex4 68.401 85.551 93.08802 89.2515 100.551 168.400 100 a
Upvotes: 5
Reputation: 5897
Another solution
dt[, .(dt[, 1], Freq = Prop * 1000)]
gender Freq
1: Male 490
2: Female 510
Some benchmark of the options given in all answers
Note that I increased the sample data quite a bit, but I was just curious about the differences between the methods for other data sets as well.
Transform is very slow here and not recommended, the other methods are quite similar and the power of .SD and .SDcols are the fastest, although in this case, keeping all your rows and do not update anything by reference using the first method is hardly slower.
set.seed(42)
dt <- data.table(
gender = rep(LETTERS[1:25], 40000),
Prop = runif(n = 1000000))
library(rbenchmark)
benchmark(
"dt[, .(dt[, 1], Freq = Prop * 1000)]" = {
dt[, .(dt[, 1], Freq = Prop * 1000)]
},
"dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = 1]" = {
dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = 1]
},
"dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = -\"Prop\"]" = {
dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = -"Prop"]
},
"dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)]" = {
dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)]
},
"transform(dt, Freq = Prop * 1000, Prop = NULL)" = {
transform(dt, Freq = Prop * 1000, Prop = NULL)
},
replications = 1000,
columns = c("test", "replications", "elapsed", "relative")
)
# test replications elapsed relative
# 1 dt[, .(dt[, 1], Freq = Prop * 1000)] 1000 18.66 1.112
# 3 dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = -"Prop"] 1000 17.02 1.014
# 2 dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = 1] 1000 16.78 1.000
# 4 dt[, transform(.SD, Freq = Prop * 1000, Prop = NULL)] 1000 333.51 19.875
# 5 transform(dt, Freq = Prop * 1000, Prop = NULL) 1000 329.41 19.631
Sidenote
Keep in mind that creating the column by reference is like a 5-fold faster
dt[, Freq := Prop * 1000]
and OP uses the argument the table is re-used later. I would suggest always to do all calculations and preparations by reference on the table when it gains in speed. You can always subset your output from there.
# test replications elapsed relative
# 1 dt[, .(dt[, 1], Freq = Prop * 1000)] 1000 16.25 5.783
# 2 dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = 1] 1000 13.33 4.744
# 3 t[, Freq := Prop * 1000] 1000 2.81 1.000
Upvotes: 1
Reputation: 887911
We can use the data.table
syntax to get the output format
dt[, c(.SD, .(Freq = Prop * 1000)), .SDcols = -"Prop"]
-output
gender Freq
1: Male 490
2: Female 510
Upvotes: 3