Keo
Keo

Reputation: 31

Unexpected behavior of apply v. for loop in R

I want to use apply instead of a for loop to speed up a function that creates a character string vector from paste-collapsing each row in a data frame, which contains strings and numbers with many decimals.
The speed up is notable, but apply forces the numbers to fill the left side with spaces so that all values have the same number of characters and rounds the numbers to integers, whereas the for loop does not.
I was able to work around this doing as.character to the numbers, but the data frame memory usage is much larger, and I still don't know why apply does this. Does anyone have an explanation or a better solution?

Using apply:

df <- data.frame(V1=rep(letters[1:20], 1000/20), V2=(1:1000)+0.00000001,
 + V3=rep(letters[1:20], 1000/20), stringsAsFactors=F)

system.time(varapl <- apply(df, 1, function(x){
                paste(x[1:3], collapse="_")
                }))
varapl[c(1,10,100,1000)]

Output:

  user  system elapsed 
  0.01    0.00    0.02 

[1] "a_   1_a" "j_  10_j" "t_ 100_t" "t_1000_t"
# Spaces to the right and rounded!

Using for:

varfor <- NULL
system.time(for(i in 1:1000){
  varfor <- c(varfor, paste(df[i,1:3], collapse="_"))
})
varfor[c(1,10,100,1000)]

Output:

   user  system elapsed 
   0.19    0.00    0.19 

[1] "a_1.00000001_a"    "j_10.00000001_j"   "t_100.00000001_t"  "t_1000.00000001_t"
# This is what I'm looking for!

The workaround was:

df2 <- data.frame(V1=rep(letters[1:20], 1000/20), 
+ V2=as.character((1:1000)+0.00000001),
+ V3=rep(letters[1:20], 1000/20), stringsAsFactors=F)

varapl[c(1,10,100,1000)]

[1] "a_1.00000001_a"   "j_10.00000001_j"  "t_100.00000001_t"  "t_1000.00000001_t"

However:

object.size(df)
26816 bytes
object.size(df2)
97208 bytes

My original data frames have millions of entries, so both speed and memory constraints are important.

Thank you in advance for your comments! Keo.

Upvotes: 3

Views: 90

Answers (2)

Keo
Keo

Reputation: 31


@alexis_laz answered the question (Thanks!) by linking to this. I'm posting it here since it it was mentioned in the comments section.

Upvotes: 0

Drvi
Drvi

Reputation: 51

I'm not sure what's causing this behavior of apply, but I'd propose an alternative since you're interested in speed. Take a look at Hadleys package tidyr and its function unite.

library(tidyr)

df <- data.frame(V1=rep(letters[1:20], 1000/20), V2=(1:1000)+0.00000001,
                 V3=rep(letters[1:20], 1000/20), stringsAsFactors=F)

unite(df, var, c(V1, V2, V3))

#              var
# 1 a_1.00000001_a
# 2 b_2.00000001_b
# 3 c_3.00000001_c
# 4 d_4.00000001_d
# 5 e_5.00000001_e
# 6 f_6.00000001_f

system.time(varapl <- unite(df, var, c(V1, V2, V3)))

# user  system elapsed 
#   0       0       0 

Upvotes: 3

Related Questions