bright-star
bright-star

Reputation: 6427

Is there a way to accelerate matrix plots?

ggpairs(), like its grandparent scatterplotMatrix(), is terribly slow as the number of pairs grows. That's fair; the number of permutations of pairs grows factorially.

What isn't fair is that I have to watch the other cores on my machine sit idle while one cranks away at 100% load.

Is there a way to parallelize large matrix plots?

Here is some sample data for benchmarking.

num.vars <- 100
num.rows <- 50000
require(GGally)
require(data.table)

tmp <- data.table(replicate(num.vars, runif(num.rows)),
                  class = as.factor(sample(0:1,size=num.rows, replace=TRUE)))

system.time({

    tmp.plot <- ggpairs(data=tmp, diag=list(continuous="density"), columns=1:num.vars,
                        colour="class", axisLabels="show")
    print(tmp.plot)})

Interestingly enough, my initial benchmarks excluding the print() statement ran at tolerable speeds (21 minutes for the above). The print statement, when added, caused what appear to be segfaults on my machine. (Hard to say at the moment because the R session is simply killed by the OS).

Is the problem in memory, or is this something that could be parallelized? (At least the plot generation part seems amenable to parallelization.)

Upvotes: 0

Views: 315

Answers (1)

Richie Cotton
Richie Cotton

Reputation: 121057

Drawing ggpairs plots is single threaded because the bulk of the work inside GGally:::print.ggpairs happens inside two for loops (somewhere around line 50, depending upon how you count lines):

for (rowPos in 1:numCol) {
    for (columnPos in 1:numCol) {

It may be possible to replace these with calls to plyr::l_ply (or similar) which has a .parallel argument. I have no idea if the graphics devices will cope OK with several cores trying to simultaneous draw things on them though. My gut feeling is that getting parallel plotting to work robustly may be non-trivial, but it could also be a fun project.

Upvotes: 4

Related Questions