Liselotte
Liselotte

Reputation: 382

how to count the number of rows with same values between 2 dataframe in R

I have 2 dataframes

df1 <- data.frame(col11=c("a","a","a","b","b"), col=c(1,2,3,4,5))
df2 <- data.frame(col21=c("c","c","d","d","d"), col=c(1,5,1,2,5))

I want to count index1 as the number of rows in col with the same values between 2 dataframes based to groups of col11 of df1 and col21 of df2, and index2 as the number of unique values in col of both dataframes. Then I want to take the ratio index3 which is calculated by index1/index2 for each couple of groups in col11 and col21. What I did is that I use inner and outer join tables to count index1 and index2 to create these intermediate dataframes

df3 <- data.frame(group11=c("a","a","b","b"), group21=c("c","d","c","d"), index1=c(1,2,1,1))
df4 <- data.frame(group11=c("a","a","b","b"), group21=c("c","d","c","d"), index2=c(5,6,4,5))

  

to have this resulted dataframe

df5 <- data.frame(group11=("a","a","b","b"), group21=c("c","d","c","d"), index3=c(0.2,0.33,0.25,0.2))

Could you help me to have the resulted dataframe without using join tables and without creating the intermediate dataframes? Thank you.

Upvotes: 1

Views: 268

Answers (2)

jay.sf
jay.sf

Reputation: 72613

You could define two Vectorized functions that do the job.

First we split the col columns of both data frames according to their col** and put them into a list.

L <- c(split(df1$col, df1$col11), split(df2$col, df2$col21))

Define FUN3 to count the lengths of the intercepts, define FUN4 to count the lengths of the "unique" values. (I named the functions according to your dat3 and dat4 interim data frames, since it's the corresponding step).

FUN3 <- Vectorize(function(x, y) length(intersect(x, y)))
FUN4 <- Vectorize(function(x, y) length(c(x, y)))

Use outer which comes from the outer product of matrices. We just need [1:2, 3:4] as subset of the result.

res3 <- as.vector(outer(L, L, FUN3)[1:2, 3:4])
res4 <- as.vector(outer(L, L, FUN4)[1:2, 3:4])

To follow the same logic and get them right, we do similar with the letters from the col** columns, where we use the list numbers 1 to 4.

nm <- do.call(rbind, strsplit(as.vector(outer(1:2, 3:4, paste)), " "))
nm <- apply(nm, 1:2, function(x) names(L)[as.double(x)])

Finally we cbind everything together and setNames.

setNames(cbind.data.frame(nm, res3 / res4), c("group11", "group21", "index3"))
#   group11 group21    index3
# 1       a       c 0.2000000
# 2       b       c 0.2500000
# 3       a       d 0.3333333
# 4       b       d 0.2000000

Edit

outer gives the outer product of the whole matrices. Since we just compare a, b to c, d we just want a part of the resulting matrix. In this example we just want the first "quadrant" i.e. the upper right 4x4 sub-matrix which is rows 1:2 and columns 3:4.

(res3 <- outer(L, L, FUN3))
#   a b c d
# a 3 0 1 2
# b 0 2 1 1
# c 1 1 2 2
# d 2 1 2 3

We may formulate that less hard-coded like so:

(rows <- which(rownames(res3) %in% unique(df1$col11)))  ## i.e. %in% c("a", "b")
# [1] 1 2
(cols <- which(colnames(res3) %in% unique(df2$col21)))  ## i.e. %in% c("c", "d")
# [1] 3 4

(res3 <- as.vector(res3[rows, cols]))
# [1] 1 1 2 1

FUN4 accordingly.

For the names nm we want to subset the names of the L list. To correspond with the data, we need numerical sequences according to the length (i.e. number) of unique strings of the relevant columns of both data frames. Since the numbers of the second one should be consecutive, we just add the number of the first data frame.

lg1 <- seq(length(unique(df1$col11)))
lg2 <- seq(length(unique(df2$col21))) + length(unique(df1$col11))

nm <- do.call(rbind, strsplit(as.vector(outer(lg1, lg2, paste)), " "))
(nm <- apply(nm, 1:2, function(x) names(L)[as.double(x)]))
#     [,1] [,2]
# [1,] "a"  "c" 
# [2,] "b"  "c" 
# [3,] "a"  "d" 
# [4,] "b"  "d" 

Upvotes: 2

Cole
Cole

Reputation: 11255

Here is a approach which unfortunatly still has a lot of joining.

library(data.table)
df1 <- data.frame(col11=c("a","a","a","b","b"), col=c(1,2,3,4,5))
df2 <- data.frame(col21=c("c","c","d","d","d"), col=c(1,5,1,2,5))
setDT(df1)
setDT(df2)

res = CJ(col11 = df1[["col11"]], col21 = df2[["col21"]], unique = TRUE)

res[, index1 := df1[df2, on = .(col)][, .N, keyby = .(col11, col21)]$N]
res[, index2 := mapply(function(x, y) length((c(df1[col11 == x, col], df2[col21 == y, col]))), col11, col21)]
res[, index3 := index1 / index2][]
#>     col11  col21 index1 index2    index3
#>    <char> <char>  <int>  <int>     <num>
#> 1:      a      c      1      5 0.2000000
#> 2:      a      d      2      6 0.3333333
#> 3:      b      c      1      4 0.2500000
#> 4:      b      d      1      5 0.2000000

We use data.table's reference semantics to directly update the data.table within the call so we have no additional objects.

  1. The CJ(...) is to set up all the unique combinations.
  2. The index1 := df1[df2, ...] is join syntax followed by determining the count (.N) of each combination. Note, I believe it is safe to not join this back to res because the keyby will result in the same order as what was made in the CJ.
  3. The mapply(...) call is a fancy loop where we filter for each row in the res for each combination. I will make changes depending on feedback on whether the col is unique or not.

Finally, it is worth pointing out that there is no simple solution for this. There are going to be intermediate calculation steps to prevent these calls from going too long.

Upvotes: 1

Related Questions