Reputation: 31
I want to use apply
instead of a for
loop to speed up a function that creates a character string vector from paste-collapsing each row in a data frame, which contains strings and numbers with many decimals.
The speed up is notable, but apply forces the numbers to fill the left side with spaces so that all values have the same number of characters and rounds the numbers to integers, whereas the for
loop does not.
I was able to work around this doing as.character
to the numbers, but the data frame memory usage is much larger, and I still don't know why apply
does this. Does anyone have an explanation or a better solution?
Using apply
:
df <- data.frame(V1=rep(letters[1:20], 1000/20), V2=(1:1000)+0.00000001,
+ V3=rep(letters[1:20], 1000/20), stringsAsFactors=F)
system.time(varapl <- apply(df, 1, function(x){
paste(x[1:3], collapse="_")
}))
varapl[c(1,10,100,1000)]
Output:
user system elapsed
0.01 0.00 0.02
[1] "a_ 1_a" "j_ 10_j" "t_ 100_t" "t_1000_t"
# Spaces to the right and rounded!
Using for
:
varfor <- NULL
system.time(for(i in 1:1000){
varfor <- c(varfor, paste(df[i,1:3], collapse="_"))
})
varfor[c(1,10,100,1000)]
Output:
user system elapsed
0.19 0.00 0.19
[1] "a_1.00000001_a" "j_10.00000001_j" "t_100.00000001_t" "t_1000.00000001_t"
# This is what I'm looking for!
The workaround was:
df2 <- data.frame(V1=rep(letters[1:20], 1000/20),
+ V2=as.character((1:1000)+0.00000001),
+ V3=rep(letters[1:20], 1000/20), stringsAsFactors=F)
varapl[c(1,10,100,1000)]
[1] "a_1.00000001_a" "j_10.00000001_j" "t_100.00000001_t" "t_1000.00000001_t"
However:
object.size(df)
26816 bytes
object.size(df2)
97208 bytes
My original data frames have millions of entries, so both speed and memory constraints are important.
Thank you in advance for your comments! Keo.
Upvotes: 3
Views: 90
Reputation: 31
@alexis_laz answered the question (Thanks!) by linking to this. I'm posting it here since it it was mentioned in the comments section.
Upvotes: 0
Reputation: 51
I'm not sure what's causing this behavior of apply, but I'd propose an alternative since you're interested in speed. Take a look at Hadleys package tidyr and its function unite.
library(tidyr)
df <- data.frame(V1=rep(letters[1:20], 1000/20), V2=(1:1000)+0.00000001,
V3=rep(letters[1:20], 1000/20), stringsAsFactors=F)
unite(df, var, c(V1, V2, V3))
# var
# 1 a_1.00000001_a
# 2 b_2.00000001_b
# 3 c_3.00000001_c
# 4 d_4.00000001_d
# 5 e_5.00000001_e
# 6 f_6.00000001_f
system.time(varapl <- unite(df, var, c(V1, V2, V3)))
# user system elapsed
# 0 0 0
Upvotes: 3