gavenkoa
gavenkoa

Reputation: 48933

Combine columns on factor

I have data with factors lang and alg and I like to compare difference for selected lang pair between all alg:

> perf[perf$lang == "java", c("alg", "cpu")]
               alg   cpu
173    binarytrees 0.196
174 chameneosredux 0.404
175  fannkuchredux 0.648

> perf[perf$lang == "python3", c("alg", "cpu")]
               alg    cpu
246    binarytrees  0.972
248  fannkuchredux 13.752
249          fasta  1.152

For binarytrees I expect to get 0.196/0.972, but for chameneosredux is NA, for fannkuchredux is 0.648/13.752, for fasta is NA, ...

One way is to sort rows on alg but I don't understand how to inject rows with NA on missing factors (all factors available in unique(perf$alg)).

UPDATE Despite original question I think that I like to combine columns of two data frames into single data frame on same factor:

   binarytrees 0.196   0.972
chameneosredux 0.404      NA
         fasta    NA   1.152
 fannkuchredux 0.648  13.752

Upvotes: 0

Views: 52

Answers (1)

nrussell
nrussell

Reputation: 18612

What you are looking for is essentially a FULL OUTER JOIN in SQL, which can be done with base::merge using all = TRUE.

Here is a comparable data set to demonstrate with:

Df <- data.frame(
    Lang = rep(LETTERS[1:5], rep(3, 5)),
    Alg = c(replicate(5, sample(letters[1:4], 3))),
    Cpu = rnorm(15),
    stringsAsFactors = FALSE
)

Note that I'm using stringsAsFactors = FALSE. I would suggest you convert your columns to character vectors as well; I don't see any need for using factors here.


This is the merge operation in a light wrapper function, just to make the presentation a little cleaner:

compare <- function(x, y, data) {
    merge(x = data[data$Lang == x[1], 2:3],
          y = data[data$Lang == y[1], 2:3],
          by = "Alg", all = TRUE,
          suffixes = c(paste0(".", x[1]), 
                       paste0(".", y[1]))
    )
}

And here it is in use:

compare("A", "D", Df)
#  Alg      Cpu.A       Cpu.D
#1   a         NA -0.06520117
#2   b  1.0587151  0.08379303
#3   c -2.0390119          NA
#4   d -0.8574474  1.27865596

compare("A", "C", Df)
#  Alg      Cpu.A      Cpu.C
#1   b  1.0587151 -1.0230431
#2   c -2.0390119 -0.7691048
#3   d -0.8574474 -1.2421078

Regarding my comment, this can also be achieved using sqldf. SQLite does not support FULL OUTER JOIN, but this shouldn't be too much of an issue if you are comfortable with SQL, as there are probably a dozen or so ways to work around that:

library(sqldf)

sqldf(
    "select x.Alg 
        ,lhs.Cpu as 'Cpu.A'
        ,rhs.Cpu as 'Cpu.D'
     from (
        select distinct d.Alg
        from Df d
     ) x
     left join Df lhs on lhs.Alg = x.Alg and lhs.Lang = 'A'
     left join Df rhs on rhs.Alg = x.Alg and rhs.Lang = 'D'
     order by x.Alg"
)

#  Alg      Cpu.A       Cpu.D
#1   a         NA -0.06520117
#2   b  1.0587151  0.08379303
#3   c -2.0390119          NA
#4   d -0.8574474  1.27865596

Upvotes: 1

Related Questions