Reputation: 642
I have two data frames with Boolean values and numeric values. If needed the numeric values could be put into categories.
var1 <- c(400,234,199,45,77,19)
var2 <- c(0,0,1,1,0,1)
var3 <- c(1,0,1,0,0,1)
df1 <- data.frame(var1,var2,var3)
var.1 <- c(78,147,670,200,75,17)
var.2 <- c(0,0,0,1,1,1)
var.3 <- c(0,1,1,0,1,1)
df2 <- data.frame(var.1,var.2,var.3)
I want to find in df1
the most similar row in df2
.
I am aware of cluster analysis, which I could do for one data frame by itself, but once I have clusters for one data frame, how would I extract and apply the same clustering algorithm to the other data frame, so that both data frames are clustered in the same way? I also need as many "clusters" as rows in the data frame, which makes me think cluster analysis is not for this task.
Additionally, every row in df1
must be matched with only one row in df2
so that at the end of the process every row in df1
matches to a different row in df2
. This is tricky, because if taking each row in isolation in df1
, the same row in df2
might get matched multiple times, which is not desired.
Upvotes: 1
Views: 2603
Reputation: 4417
You don'T have to do clustering, just search for the smallest distance. Take the first row of df1
and cbind
it with df2
. This is easiest, if column names are identical
var1 <- c(400,234,199,45,77,19)
var2 <- c(0,0,1,1,0,1)
var3 <- c(1,0,1,0,0,1)
df1 <- data.frame(var.1,var.2,var.3)
var.1 <- c(78,147,670,200,75,17)
var.2 <- c(0,0,0,1,1,1)
var.3 <- c(0,1,1,0,1,1)
df2 <- data.frame(var.1,var.2,var.3)
rbind(df1[1,], df2)
the result of this can be examinated with dist
. We are only interested in the first column of the result, i. e. the first nrow(df2)
results.
dist(rbind(df1[1,], df2))[1:nrow(df2)]
evaluates to
> dist(rbind(df1[1,], df2))[1:nrow(df2)]
[1] 0.000000 69.007246 592.000845 122.004098 3.316625
[6] 61.016391
and which.min
tells us, which of the rows has the smallest distance:
> which.min(dist(rbind(df1[1,], df2))[1:nrow(df2)])
[1] 1
So the first line in df2
has the smallest distance to the first line of df1
. You can put that into an apply
or a for
loop to do the calculation for each row in df1
.
You have to answer the question though, how the distance of a mixture of Boolean and numeric values should be computed. There is no universal answer for that.
Upvotes: 1