Maxwell Chandler
Maxwell Chandler

Reputation: 642

for every row in one data frame find the most similar row in another

I have two data frames with Boolean values and numeric values. If needed the numeric values could be put into categories.

var1 <- c(400,234,199,45,77,19)
var2 <- c(0,0,1,1,0,1)
var3 <- c(1,0,1,0,0,1) 
df1 <- data.frame(var1,var2,var3)
var.1 <- c(78,147,670,200,75,17)
var.2 <- c(0,0,0,1,1,1)
var.3 <- c(0,1,1,0,1,1) 
df2 <- data.frame(var.1,var.2,var.3)

I want to find in df1 the most similar row in df2.

I am aware of cluster analysis, which I could do for one data frame by itself, but once I have clusters for one data frame, how would I extract and apply the same clustering algorithm to the other data frame, so that both data frames are clustered in the same way? I also need as many "clusters" as rows in the data frame, which makes me think cluster analysis is not for this task.

Additionally, every row in df1 must be matched with only one row in df2 so that at the end of the process every row in df1 matches to a different row in df2. This is tricky, because if taking each row in isolation in df1, the same row in df2 might get matched multiple times, which is not desired.

Upvotes: 1

Views: 2603

Answers (1)

Bernhard
Bernhard

Reputation: 4417

You don'T have to do clustering, just search for the smallest distance. Take the first row of df1 and cbind it with df2. This is easiest, if column names are identical

var1 <- c(400,234,199,45,77,19)
var2 <- c(0,0,1,1,0,1)
var3 <- c(1,0,1,0,0,1) 
df1 <- data.frame(var.1,var.2,var.3)
var.1 <- c(78,147,670,200,75,17)
var.2 <- c(0,0,0,1,1,1)
var.3 <- c(0,1,1,0,1,1) 
df2 <- data.frame(var.1,var.2,var.3)

rbind(df1[1,], df2)

the result of this can be examinated with dist. We are only interested in the first column of the result, i. e. the first nrow(df2) results.

dist(rbind(df1[1,], df2))[1:nrow(df2)]

evaluates to

> dist(rbind(df1[1,], df2))[1:nrow(df2)]
[1]   0.000000  69.007246 592.000845 122.004098   3.316625
[6]  61.016391

and which.min tells us, which of the rows has the smallest distance:

> which.min(dist(rbind(df1[1,], df2))[1:nrow(df2)])
[1] 1

So the first line in df2 has the smallest distance to the first line of df1. You can put that into an apply or a for loop to do the calculation for each row in df1.

You have to answer the question though, how the distance of a mixture of Boolean and numeric values should be computed. There is no universal answer for that.

Upvotes: 1

Related Questions