how to count the number of rows with same values between 2 dataframe in R

Question

I have 2 dataframes

df1 <- data.frame(col11=c("a","a","a","b","b"), col=c(1,2,3,4,5))
df2 <- data.frame(col21=c("c","c","d","d","d"), col=c(1,5,1,2,5))

I want to count index1 as the number of rows in col with the same values between 2 dataframes based to groups of col11 of df1 and col21 of df2, and index2 as the number of unique values in col of both dataframes. Then I want to take the ratio index3 which is calculated by index1/index2 for each couple of groups in col11 and col21. What I did is that I use inner and outer join tables to count index1 and index2 to create these intermediate dataframes

df3 <- data.frame(group11=c("a","a","b","b"), group21=c("c","d","c","d"), index1=c(1,2,1,1))
df4 <- data.frame(group11=c("a","a","b","b"), group21=c("c","d","c","d"), index2=c(5,6,4,5))

to have this resulted dataframe

df5 <- data.frame(group11=("a","a","b","b"), group21=c("c","d","c","d"), index3=c(0.2,0.33,0.25,0.2))

Could you help me to have the resulted dataframe without using join tables and without creating the intermediate dataframes? Thank you.

jay.sf · Accepted Answer

You could define two Vectorized functions that do the job.

First we split the col columns of both data frames according to their col** and put them into a list.

L <- c(split(df1$col, df1$col11), split(df2$col, df2$col21))

Define FUN3 to count the lengths of the intercepts, define FUN4 to count the lengths of the "unique" values. (I named the functions according to your dat3 and dat4 interim data frames, since it's the corresponding step).

FUN3 <- Vectorize(function(x, y) length(intersect(x, y)))
FUN4 <- Vectorize(function(x, y) length(c(x, y)))

Use outer which comes from the outer product of matrices. We just need [1:2, 3:4] as subset of the result.

res3 <- as.vector(outer(L, L, FUN3)[1:2, 3:4])
res4 <- as.vector(outer(L, L, FUN4)[1:2, 3:4])

To follow the same logic and get them right, we do similar with the letters from the col** columns, where we use the list numbers 1 to 4.

nm <- do.call(rbind, strsplit(as.vector(outer(1:2, 3:4, paste)), " "))
nm <- apply(nm, 1:2, function(x) names(L)[as.double(x)])

Finally we cbind everything together and setNames.

setNames(cbind.data.frame(nm, res3 / res4), c("group11", "group21", "index3"))
#   group11 group21    index3
# 1       a       c 0.2000000
# 2       b       c 0.2500000
# 3       a       d 0.3333333
# 4       b       d 0.2000000

Edit

outer gives the outer product of the whole matrices. Since we just compare a, b to c, d we just want a part of the resulting matrix. In this example we just want the first "quadrant" i.e. the upper right 4x4 sub-matrix which is rows 1:2 and columns 3:4.

(res3 <- outer(L, L, FUN3))
#   a b c d
# a 3 0 1 2
# b 0 2 1 1
# c 1 1 2 2
# d 2 1 2 3

We may formulate that less hard-coded like so:

(rows <- which(rownames(res3) %in% unique(df1$col11)))  ## i.e. %in% c("a", "b")
# [1] 1 2
(cols <- which(colnames(res3) %in% unique(df2$col21)))  ## i.e. %in% c("c", "d")
# [1] 3 4

(res3 <- as.vector(res3[rows, cols]))
# [1] 1 1 2 1

FUN4 accordingly.

For the names nm we want to subset the names of the L list. To correspond with the data, we need numerical sequences according to the length (i.e. number) of unique strings of the relevant columns of both data frames. Since the numbers of the second one should be consecutive, we just add the number of the first data frame.

lg1 <- seq(length(unique(df1$col11)))
lg2 <- seq(length(unique(df2$col21))) + length(unique(df1$col11))

nm <- do.call(rbind, strsplit(as.vector(outer(lg1, lg2, paste)), " "))
(nm <- apply(nm, 1:2, function(x) names(L)[as.double(x)]))
#     [,1] [,2]
# [1,] "a"  "c" 
# [2,] "b"  "c" 
# [3,] "a"  "d" 
# [4,] "b"  "d"

how to count the number of rows with same values between 2 dataframe in R

Answers (2)

Edit

Related Questions