Reputation: 48933
I have data with factors lang
and alg
and I like to compare difference for selected lang
pair between all alg
:
> perf[perf$lang == "java", c("alg", "cpu")]
alg cpu
173 binarytrees 0.196
174 chameneosredux 0.404
175 fannkuchredux 0.648
> perf[perf$lang == "python3", c("alg", "cpu")]
alg cpu
246 binarytrees 0.972
248 fannkuchredux 13.752
249 fasta 1.152
For binarytrees
I expect to get 0.196/0.972
, but for chameneosredux
is NA
, for fannkuchredux
is 0.648/13.752
, for fasta
is NA
, ...
One way is to sort rows on alg
but I don't understand how to inject rows with NA
on missing factors (all factors available in unique(perf$alg)
).
UPDATE Despite original question I think that I like to combine columns of two data frames into single data frame on same factor:
binarytrees 0.196 0.972
chameneosredux 0.404 NA
fasta NA 1.152
fannkuchredux 0.648 13.752
Upvotes: 0
Views: 52
Reputation: 18612
What you are looking for is essentially a FULL OUTER JOIN
in SQL, which can be done with base::merge
using all = TRUE
.
Here is a comparable data set to demonstrate with:
Df <- data.frame(
Lang = rep(LETTERS[1:5], rep(3, 5)),
Alg = c(replicate(5, sample(letters[1:4], 3))),
Cpu = rnorm(15),
stringsAsFactors = FALSE
)
Note that I'm using stringsAsFactors = FALSE
. I would suggest you convert your columns to character
vectors as well; I don't see any need for using factor
s here.
This is the merge
operation in a light wrapper function, just to make the presentation a little cleaner:
compare <- function(x, y, data) {
merge(x = data[data$Lang == x[1], 2:3],
y = data[data$Lang == y[1], 2:3],
by = "Alg", all = TRUE,
suffixes = c(paste0(".", x[1]),
paste0(".", y[1]))
)
}
And here it is in use:
compare("A", "D", Df)
# Alg Cpu.A Cpu.D
#1 a NA -0.06520117
#2 b 1.0587151 0.08379303
#3 c -2.0390119 NA
#4 d -0.8574474 1.27865596
compare("A", "C", Df)
# Alg Cpu.A Cpu.C
#1 b 1.0587151 -1.0230431
#2 c -2.0390119 -0.7691048
#3 d -0.8574474 -1.2421078
Regarding my comment, this can also be achieved using sqldf
. SQLite
does not support FULL OUTER JOIN
, but this shouldn't be too much of an issue if you are comfortable with SQL, as there are probably a dozen or so ways to work around that:
library(sqldf)
sqldf(
"select x.Alg
,lhs.Cpu as 'Cpu.A'
,rhs.Cpu as 'Cpu.D'
from (
select distinct d.Alg
from Df d
) x
left join Df lhs on lhs.Alg = x.Alg and lhs.Lang = 'A'
left join Df rhs on rhs.Alg = x.Alg and rhs.Lang = 'D'
order by x.Alg"
)
# Alg Cpu.A Cpu.D
#1 a NA -0.06520117
#2 b 1.0587151 0.08379303
#3 c -2.0390119 NA
#4 d -0.8574474 1.27865596
Upvotes: 1