sds
sds

Reputation: 60014

do.call/rbind slower on data.frame/data.table than on matrix?

I have a long file which I read into a list using readLines/strsplit:

> head(edges.split)
[[1]]
 [1] "1"       "1263895" "4415645" "1798592" "576013"  "1315720" "1179526"
 [8] "4257735" "4368477" "4045891" "336813"  "4257736" "1179526" "3494186"
[15] "4257735" "4257735"

[[2]]
 [1] "2"       "4831424" "2070750" "3"       "798464"  "1208032" "351213" 
 [8] "2816552" "1484206" "4493159" "5"       "1"       "4"       "4493043"
[15] "3126743" "1207504" "1499874" "214487"  "173486"  "1484207"

[[3]]
 [1] "3"       "2"       "4"       "3648046" "1872711" "1275714" "702512" 
 [8] "1275655" "1667650" "1484207"

[[4]]
 [1] "4"       "4463893" "3618982" "3624614" "3299496" "4348657" "4104419"
 [8] "3070955" "2707725" "5"       "4463739" "4158900" "1135360" "653364" 
[15] "806185"  "2465873" "3299496" "3060623" "1965801" "1005013" "3070955"
[22] "3103098" "4283482" "1951317" "1487656" "4632995" "4402849" "2707725"
[29] "1564441" "576420"  "1972753" "1740415" "3070390" "2391329" "3827055"
[36] "996590"  "4267592" "3787645" "1857269" "4348657" "3491190" "3787645"
[43] "3149658" "3159019" "3787645" "1135358" "2183685" "2303714" "3159019"
[50] "2465873" "4276571" "4446386" "2854060" "3299496" "1740415" "4402849"
[57] "4632995" "3494237" "2050300" "1135358" "3787645"

[[5]]
 [1] "5"       "336813"  "4"       "3159019" "2303714" "1740415" "4"      
 [8] "305277"  "2707725" "2303714" "1740415" "3494237" "1135358" "4"      

[[6]]
 [1] "6"       "499620"  "3622792" "1315540" "576013"  "1798592" "3965874"
 [8] "752451"  "1017219" "1762253" "3693356" "348788"  "4038359" "336813" 
[15] "3449680" "4717601" "3545052" "4494041" "748702"  "1093005" "3143747"
[22] "1648572" "1093005" "1648572" "3143747"

Now I want to convert it to a 3 column data.frame/data.table:

edges.df <- do.call(rbind,lapply(edges.split,function (l)
  if (length(l) <= 1) NULL
  else {
    tab <- table(tail(l,-1))
    data.table(src=as.integer(l[1]),
               dst=as.integer(names(tab)),
               weight=as.numeric(tab))
  }))
str(edges.df)
str(edges.df) # 156716688x2
Classes ?data.table? and 'data.frame':  116330611 obs. of  3 variables:
 $ src   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ dst   : int  1179526 1263895 1315720 1798592 336813 3494186 4045891 4257735 4257736 4368477 ...
 $ weight: num  2 1 1 1 1 1 1 3 1 1 ...

this takes 5.5 hours and consumes 20GB RAM (the data.frame version is till running - 15 hours and counting).

The simpler matrix version

edges.df <- do.call(rbind,lapply(edges.split,function (l)
  cbind(as.integer(l[1]),as.integer(tail(l,-1)))))

finished in under 10 minutes, producing a 156716688x2 matrix.

Is the huge time difference due to the table calls? How can I speed up this?

Upvotes: 2

Views: 180

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

If I understand your problem correctly, I would just try to piece together the separate parts and then tabulate. Make use of efficient functions like rep and .N to help improve performance.

Without reproducible data, I would suggest trying something like:

## Extract just the first values of each list element
Nam <- vapply(edges.split, function(x) x[1], character(1L))

## How long is each list element (minus the first element)?
Len <- vapply(edges.split, length, numeric(1L)) - 1

## Put the pieces together and use `.N` to aggregate
data.table(src = rep(Nam, Len), 
           dst = unlist(lapply(edges.split, 
                               function(x) x[-1])))[
                                 , list(weight = .N), by = .(src, dst)]

However, it should be noted that you would need to make "Nam" unique in order to match the output from your approach.


Here are some benchmarks. David's function doesn't quite match the output, but I think that it can be easily modified to do so (just don't have the time to experiment right now).

First, the functions:

opFun <- function() {
  do.call(rbind,lapply(edges.split,function (l)
    if (length(l) <= 1) NULL
    else {
      tab <- table(tail(l,-1))
      data.table(src=as.integer(l[1]),
                 dst=as.integer(names(tab)),
                 weight=as.numeric(tab))
    }))
} 


myFun <- function() {
    Nam <- vapply(edges.split, function(x) x[1], character(1L))
    Nam <- make.unique(Nam)
    Len <- vapply(edges.split, length, numeric(1L)) - 1

    data.table(src = rep(Nam, Len), 
               dst = unlist(lapply(edges.split, 
                                   function(x) x[-1])))[
                                     , list(weight = .N), by = .(src, dst)]
}

da <- function() {
  setDT(unnest(edges.split, "src"))[
    , .(weight = .N), keyby = .(src, dst = x)]
}

Second, a way to make some sample data:

data.maker <- function(size) {
  set.seed(1)
  lapply(seq_len(size), function(x) {
    as.character(c(x, sample(100, sample(20), TRUE)))
  })
}

Third, the timing:

library(microbenchmark)

## 100 list items
edges.split <- data.maker(100)
microbenchmark(opFun(), myFun(), da(), times = 10)
# Unit: milliseconds
#    expr        min         lq       mean     median        uq       max neval
# opFun() 227.980049 231.180087 235.767195 238.358194 239.68957 240.84357    10
# myFun()   6.276912   6.372855   7.015674   6.700846   6.76109  10.79427    10
#    da()   9.984779  10.152121  10.419066  10.350701  10.73314  11.01650    10

## 100k list items
edges.split <- data.maker(100000)
system.time(da())
# user  system elapsed 
# 9.52    0.11    9.64 
system.time(myFun())
# user  system elapsed 
# 3.03    0.08    3.14 

## 1M list items
edges.split <- data.maker(1000000)
system.time(da())
#    user  system elapsed 
#  129.53    2.22  132.51 
system.time(myFun())
#    user  system elapsed 
#   31.30    0.71   32.14 

Upvotes: 2

David Arenburg
David Arenburg

Reputation: 92292

I think calling operations such as data.table, as.integer (twice), as.numeric and table in every single iteration is just wrong approach. I would recommend first using unnest from tidyr in order to create your data set and then getting data.table involved. I don't have your real data, but I'd bet this should be faster

library(tidyr)
library(data.table)
edges.df <- setDT(unnest(edges.split, "src"))[, 
                  .(weight = .N), 
                  keyby = .(src, dst = x)]

Output

head(edges.df)
#    src     dst weight
# 1:  X1       1      1
# 2:  X1 1179526      2
# 3:  X1 1263895      1
# 4:  X1 1315720      1
# 5:  X1 1798592      1
# 6:  X1  336813      1

Upvotes: 2

Related Questions