Reputation: 4226
I have the following table, original_table
that results from comparing the frequency of pairs of numbers across the same indices of vector_1
and vector_2
:
vector_1 <- c(5, 6, 5, 4, 6, 6, 4, 1, 6, 7, 5, 3, 3, 4, 4, 7, 7, 7, 2, 7, 2, 6, 1)
vector_2 <- c(1, 2, 1, 3, 4, 4, 4, 2, 4, 7, 2, 5, 5, 3, 3, 6, 7, 7, 6, 3, 6, 7, 2)
original_table <- table(vector_1, vector_2)
str(original_table)
vector_2
vector_1 1 2 3 4 5 6 7
1 0 2 0 0 0 0 0
2 0 0 0 0 0 2 0
3 0 0 0 0 2 0 0
4 0 0 3 1 0 0 0
5 2 1 0 0 0 0 0
6 0 1 0 3 0 0 1
7 0 0 1 0 0 1 3
I'm trying to recode the values of vector_1
to maximize the number of pairs of the same number for values with the same index in vector_2
. I'm ultimately trying to recode these to use double split cross validation described by Breckenridge (2000).
The only "rule" is that each value has to be recoded with a unique value, so both 1
and 2
can't both be recoded as 3
, for example.
I've done this more or less manually this way using car::recode
:
vector_1 <- car::recode(vector_1, "6 = 4; 7 = 7; 4 = 3; 5 = 1; 3 = 5; 2 = 6; 1 = 2")
optimized_table <- table(vector_1, vector_2)
str(optimized_table)
vector_2
vector_1 1 2 3 4 5 6 7
1 2 1 0 0 0 0 0
2 0 2 0 0 0 0 0
3 0 0 3 1 0 0 0
4 0 1 0 3 0 0 1
5 0 0 0 0 2 0 0
6 0 0 0 0 0 2 0
7 0 0 1 0 0 1 3
There are at least a couple of problems with doing it this way: I eyeballed it, so I'm not certain this is the optimal way to maximize the overall number number of pairs between the vectors, and it's not easily reproducible with a different set of data. I'm looking for a way to do this better / more automatically, but I can't easily find a programmatic or smart approach to doing it.
Upvotes: 2
Views: 136
Reputation: 89097
This is known as the assignment problem. One way to solve it is using integer programming; you can use lpSolve::lp.assign
:
library(lpSolve)
res <- lp.assign(-original_table)
l <- apply(res$solution > 0.5, 1, which)
# [1] 2 6 5 3 1 4 7
An a priori faster way to solve the problem is using the Hungarian algorithm, implemented in the clue
package:
library(clue)
res <- solve_LSAP(original_table, maximum = TRUE)
# Optimal assignment:
# 1 => 2, 2 => 6, 3 => 5, 4 => 3, 5 => 1, 6 => 4, 7 => 7
l <- as.integer(res)
# [1] 2 6 5 3 1 4 7
Finally, you can recode using:
vector_1 <- l[vector_1]
Upvotes: 5
Reputation: 215047
If the number of unique values in the two vectors is not very large, we can find it out in a brutal force way by constructing all permutations of the possible recode, looping through the permutations, recoding the vector_1
and calculating the overlap with vector_2
and taking the maximum. This may not scale to different data set either but with a little bit modification should be easily applied to two different vectors:
library(permute)
n = 7 # number of unique values in vector_1 and vector_2
recodes = rbind(1:n, allPerms(n)) # calculate all possible recodes including the identity
which.max(apply(recodes, 1, function(p) sum((1:n)[match(vector_1, p)] == vector_2)))
# [1] 2943
# this line loop through possible permutations and find out the maximum overlap of the two
# vectors after recoding, here we used `match` instead of recode because it is easier to
# use with vectors and will generate the same results
recodes[2943,]
# [1] 5 1 4 6 3 2 7
Apply this recoding to vector_1
generates:
vector_1 = (1:n)[match(vector_1, recodes[2943, ])]
table(vector_1, vector_2)
# vector_2
# vector_1 1 2 3 4 5 6 7
# 1 2 1 0 0 0 0 0
# 2 0 2 0 0 0 0 0
# 3 0 0 3 1 0 0 0
# 4 0 1 0 3 0 0 1
# 5 0 0 0 0 2 0 0
# 6 0 0 0 0 0 2 0
# 7 0 0 1 0 0 1 3
This gives same result as OP, and should strengthen the belief that the recoding provided is optimized.
Upvotes: 2
Reputation: 18490
Here is a greedy approach: The function assign_group
takes both vectors, one cluster number of vector 1 that is to be recoded and a vector of cluster numbers of vector_2 that are available (i.e. not assigned to other clusters of vector_1). The function calculates then which to which cluster number of the available clusters in v2avail
the cluster number v1cl
should be mapped. This is done by searching for the group with the most simultaneous occurences.
assign_group <- function(v1, v2, v1cl, v2avail) {
one_comparison <- function(v2cand) sum(v1==v1cl & v2==v2cand)
counts <- sapply(v2avail, FUN=one_comparison)
return(v2avail[which.max(counts)])
}
Then we can iterate through the cluster numbers of vector_1
and find the "best" cluster for each cluster number. The result, recode_map
, is a mapping from the cluster numbers of vector_1
to the cluster numbers of vector_2
.
v2avail <- unique(vector_2)
n <- length(v2avail)
recode_map <- rep(NA, n)
for (i in seq(n)) {
best <- assign_group(vector_1, vector_2, i, v2avail)
recode_map[i] <- best
v2avail <- setdiff(v2avail, best) # don't assign the same number twice
}
The recoded vector leads to similar results as in your question:
v1perm <- recode_map[vector_1]
table(v1perm, vector_2)
This approach assumes that vector_1
and vector_2
are made out of numbers 1:n
. The result is not optimal in general, and it depends on the order in which the assignment of the groups takes place. Maybe the result would be better if first the indices 1:n
are ordered by the number of occurrences in vector_1
and the for
loop is run in this order.
Upvotes: 3