Reputation: 57
i need help to speed up a little bit of code. i have a data.frame
"df" and would like to create new columns and fill them with given values. Here a sample code how i do it.
df <- as.data.frame(1:20)
a <- c(31:50)
b <- c(201:220)
df[c("A","B")] <- c(a, b)
now the problem is that my data is big (some million rows) and it take more time than expected, so i think there is a better way. Any ideas? Thank you!
Upvotes: 1
Views: 304
Reputation: 5958
You stated you have 'some million' rows so here is an excerpt of benchmarks with 3 columns of 10 million rows...
R 3.0.3 (on 32bit Celeron system w/ 2GB memory)
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 35.38 56.03 64.82 67.77 185.2 100
## df.add(df) 181.43 214.80 221.42 229.81 366.6 100
## dt.addB(dt) 2359.54 2457.09 2513.11 2577.00 6398.0 100
## dt.addA(dt) 2913.74 2995.64 3047.29 3125.82 6791.1 100
R 3.1.0 (on 64bit Haswell i7 w/ 24 GB memory)
## Unit: microseconds
## expr min lq median uq max neval
## df.add(df) 10.25 30.74 33.36 48.53 84.25 100
## dt.addC(dt) 27120.45 27563.79 27990.22 29642.46 87637.63 100
## dt.addB(dt) 38452.71 39018.90 46225.69 50142.46 130893.53 100
## dt.addA(dt) 193268.78 247749.71 251380.74 256380.43 440916.17 100
Note:
The difference between data.frame
and data.table
on 3.1.0 can be explained by the new way that R 3.1.0 handles assignments. Arun (one of the data.table
authors) does so in this chat log.
df.add
( a common base way to add columns to dat.frame
).
df$b <- b.vals
df$c <- c.vals
dt.addA
(the common base data.frame
method applied to data.table
)
dt$b <- b.vals
dt$c <- c.vals
dt.addB
(a common data.table
way to add columns)
dt[,`:=`(b=b.vals, c=c.vals)]
dt.addC
(another data.table
method of setting values [from Arun] )
## to reduce the overhead due to `[.data.table` on small data.frames.
set(dt, j="b", value=b.vals)
set(dt, j="c", value=c.vals)
Benchmarks for other data set sizes
R 3.1.0 on i7 System
# Test @ 1,000
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 6.007 10.38 11.71 12.50 20.79 100
## df.add(df) 11.534 19.49 20.57 21.32 940.63 100
## dt.addB(dt) 326.166 344.85 351.43 365.47 1412.86 100
## dt.addA(dt) 798.777 850.47 867.60 888.23 1935.20 100
## test relative
## 1 df.add(df) 1
## 4 dt.addC(dt) 1
## 3 dt.addB(dt) 35
## 2 dt.addA(dt) 87
# Test @ 10,000
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 11.13 17.88 19.20 20.80 988.9 100
## df.add(df) 10.97 20.56 22.65 24.94 41.1 100
## dt.addB(dt) 333.17 364.15 389.87 419.08 1347.0 100
## dt.addA(dt) 823.99 875.88 897.10 1076.90 29233.1 100
## test relative
## 1 df.add(df) 1
## 4 dt.addC(dt) 1
## 3 dt.addB(dt) 19
## 2 dt.addA(dt) 50
# Test @ 10,000,000
## Unit: microseconds
## expr min lq median uq max neval
## df.add(df) 10.25 30.74 33.36 48.53 84.25 100
## dt.addC(dt) 27120.45 27563.79 27990.22 29642.46 87637.63 100
## dt.addB(dt) 38452.71 39018.90 46225.69 50142.46 130893.53 100
## dt.addA(dt) 193268.78 247749.71 251380.74 256380.43 440916.17 100
## test relative
## 1 df.add(df) 1
## 4 dt.addC(dt) 1536
## 3 dt.addB(dt) 2213
## 2 dt.addA(dt) 11667
R 3.0.3 on Celeron System
# Test @ 1,000
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 55.78 82.58 94.48 96.14 176.1 100
## df.add(df) 182.65 215.36 220.10 225.03 361.6 100
## dt.addB(dt) 2699.10 2774.61 2827.34 2894.23 3442.2 100
## dt.addA(dt) 5259.89 6066.00 6122.37 6231.50 10265.9 100
## test relative
## 4 dt.addC(dt) 1.000
## 1 df.add(df) 2.889
## 3 dt.addB(dt) 32.444
## 6 dfadd2dtB(dt) 69.667
## 2 dt.addA(dt) 69.889
## 5 dfadd2dtA(dt) 96.000
# Test @ 10,000
## Unit: microseconds
## expr min lq median uq max neval
## dt.addC(dt) 134.0 162.8 168.7 185.8 4135 100
## df.add(df) 576.7 616.4 633.7 663.2 72749 100
## dt.addB(dt) 2789.8 2932.6 2993.0 3054.7 6702 100
## dt.addA(dt) 5400.6 6701.5 6819.0 10079.2 11518 100
## test relative
## 4 dt.addC(dt) 1.000
## 1 df.add(df) 8.143
## 3 dt.addB(dt) 14.619
## 2 dt.addA(dt) 34.286
## 6 dfadd2dtB(dt) 34.381
## 5 dfadd2dtA(dt) 53.810
# Test @ 10,000,000
## Unit: milliseconds
## expr min lq median uq max neval
## dt.addC(dt) 121.1 146.2 147.2 161.8 303.8 100
## dt.addB(dt) 197.7 225.4 228.0 270.2 380.7 100
## df.add(df) 767.8 823.5 857.0 938.2 1156.9 100
## dt.addA(dt) 709.6 1071.9 1112.6 1170.1 1343.9 100
## test relative
## 4 dt.addC(dt) 1.000
## 3 dt.addB(dt) 1.566
## 1 df.add(df) 6.172
## 2 dt.addA(dt) 7.594
```
System/Session Info...
Intel® Core™ i7-4700MQ Processor
24 GB
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rbenchmark_1.0.0 microbenchmark_1.3-0 data.table_1.9.2
##
## "Linux" "3.11.0-19-generic" "x86_64"
Intel(R) Celeron(R) CPU 2.53GHz
2 GB
## R version 3.0.3 (2014-03-06)
## Platform: i686-pc-linux-gnu (32-bit)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rbenchmark_1.0.0 microbenchmark_1.3-0 data.table_1.9.2
## [4] knitr_1.5
##
## "Linux" "3.2.0-60-generic-pae" "i686"
Upvotes: 3
Reputation: 3224
The task of extending data.frame
s (or any object) causes R to copy the whole object when you try to add a new column. Package data.table
offers some great performance features that are added on to the data.frame
model. It allows (among other things) to add columns in place. See the code below for a simple demo:
require(data.table)
a2 <- data.table(x=1:10)
a2[, y:=21:30] ## this will create y inside a2 without copying it
summary(a2) ## just like using a data.frame
The resulting object (a data.table
) will play nice with (almost) all code that makes use of data.frame
. It has an alternative syntax most operations, which are performed much more efficiently. It's worth spending some time looking into.
If you'd like to add multiple columns, then:
a2[, `:=`(y=21:30, z=31:40)]
Edit: @Thell has taken the time and prepared benchmarks with different methods for extending a data.frame
. They suggest that despite the copying data.frame
is faster. Keep this in mind as an alternative and see which one works best for your code.
Upvotes: 5
Reputation: 234
Why don't you simply do the following:
df <- data.frame (x=1:20)
df$a <- 31:50
df$b <- 201:220
There's an excellent ebook called "R Fundamentals and Graphics" which will give you a solid understanding of the basics of R and its graphical features.
Upvotes: 1