Reputation: 382
I have 2 dataframes
df1 <- data.frame(col11=c("a","a","a","b","b"), col=c(1,2,3,4,5))
df2 <- data.frame(col21=c("c","c","d","d","d"), col=c(1,5,1,2,5))
I want to count index1
as the number of rows in col
with the same values between 2 dataframes based to groups of col11
of df1
and col21
of df2
, and index2
as the number of unique values in col
of both dataframes. Then I want to take the ratio index3
which is calculated by index1
/index2
for each couple of groups in col11
and col21
.
What I did is that I use inner and outer join tables to count index1
and index2
to create these intermediate dataframes
df3 <- data.frame(group11=c("a","a","b","b"), group21=c("c","d","c","d"), index1=c(1,2,1,1))
df4 <- data.frame(group11=c("a","a","b","b"), group21=c("c","d","c","d"), index2=c(5,6,4,5))
to have this resulted dataframe
df5 <- data.frame(group11=("a","a","b","b"), group21=c("c","d","c","d"), index3=c(0.2,0.33,0.25,0.2))
Could you help me to have the resulted dataframe without using join tables and without creating the intermediate dataframes? Thank you.
Upvotes: 1
Views: 268
Reputation: 72613
You could define two Vectorize
d functions that do the job.
First we split the col
columns of both data frames according to their col**
and put them into a list.
L <- c(split(df1$col, df1$col11), split(df2$col, df2$col21))
Define FUN3
to count the lengths of the intercepts, define FUN4
to count the lengths of the "unique" values. (I named the functions according to your dat3
and dat4
interim data frames, since it's the corresponding step).
FUN3 <- Vectorize(function(x, y) length(intersect(x, y)))
FUN4 <- Vectorize(function(x, y) length(c(x, y)))
Use outer
which comes from the outer product of matrices. We just need [1:2, 3:4]
as subset of the result.
res3 <- as.vector(outer(L, L, FUN3)[1:2, 3:4])
res4 <- as.vector(outer(L, L, FUN4)[1:2, 3:4])
To follow the same logic and get them right, we do similar with the letters from the col**
columns, where we use the list numbers 1 to 4.
nm <- do.call(rbind, strsplit(as.vector(outer(1:2, 3:4, paste)), " "))
nm <- apply(nm, 1:2, function(x) names(L)[as.double(x)])
Finally we cbind
everything together and setNames
.
setNames(cbind.data.frame(nm, res3 / res4), c("group11", "group21", "index3"))
# group11 group21 index3
# 1 a c 0.2000000
# 2 b c 0.2500000
# 3 a d 0.3333333
# 4 b d 0.2000000
outer
gives the outer product of the whole matrices. Since we just compare a, b
to c, d
we just want a part of the resulting matrix. In this example we just want the first "quadrant" i.e. the upper right 4x4 sub-matrix which is rows 1:2
and columns 3:4
.
(res3 <- outer(L, L, FUN3))
# a b c d
# a 3 0 1 2
# b 0 2 1 1
# c 1 1 2 2
# d 2 1 2 3
We may formulate that less hard-coded like so:
(rows <- which(rownames(res3) %in% unique(df1$col11))) ## i.e. %in% c("a", "b")
# [1] 1 2
(cols <- which(colnames(res3) %in% unique(df2$col21))) ## i.e. %in% c("c", "d")
# [1] 3 4
(res3 <- as.vector(res3[rows, cols]))
# [1] 1 1 2 1
FUN4
accordingly.
For the names nm
we want to subset the names of the L
list. To correspond with the data, we need numerical seq
uences according to the length
(i.e. number) of unique
strings of the relevant columns of both data frames. Since the numbers of the second one should be consecutive, we just add the number of the first data frame.
lg1 <- seq(length(unique(df1$col11)))
lg2 <- seq(length(unique(df2$col21))) + length(unique(df1$col11))
nm <- do.call(rbind, strsplit(as.vector(outer(lg1, lg2, paste)), " "))
(nm <- apply(nm, 1:2, function(x) names(L)[as.double(x)]))
# [,1] [,2]
# [1,] "a" "c"
# [2,] "b" "c"
# [3,] "a" "d"
# [4,] "b" "d"
Upvotes: 2
Reputation: 11255
Here is a data.table approach which unfortunatly still has a lot of joining.
library(data.table)
df1 <- data.frame(col11=c("a","a","a","b","b"), col=c(1,2,3,4,5))
df2 <- data.frame(col21=c("c","c","d","d","d"), col=c(1,5,1,2,5))
setDT(df1)
setDT(df2)
res = CJ(col11 = df1[["col11"]], col21 = df2[["col21"]], unique = TRUE)
res[, index1 := df1[df2, on = .(col)][, .N, keyby = .(col11, col21)]$N]
res[, index2 := mapply(function(x, y) length((c(df1[col11 == x, col], df2[col21 == y, col]))), col11, col21)]
res[, index3 := index1 / index2][]
#> col11 col21 index1 index2 index3
#> <char> <char> <int> <int> <num>
#> 1: a c 1 5 0.2000000
#> 2: a d 2 6 0.3333333
#> 3: b c 1 4 0.2500000
#> 4: b d 1 5 0.2000000
We use data.table
's reference semantics to directly update the data.table within the call so we have no additional objects.
CJ(...)
is to set up all the unique combinations.index1 := df1[df2, ...]
is join syntax followed by determining the count (.N) of each combination. Note, I believe it is safe to not join this back to res
because the keyby
will result in the same order as what was made in the CJ
.mapply(...)
call is a fancy loop where we filter for each row in the res
for each combination. I will make changes depending on feedback on whether the col
is unique or not.Finally, it is worth pointing out that there is no simple solution for this. There are going to be intermediate calculation steps to prevent these calls from going too long.
Upvotes: 1