Reputation: 33
Suppose there are a few firm combination result in certain best value, how to efficiently select unique best combination that every firm appear only once the data.table way?
The sample data:
require(data.table)
set.seed(1234)
allcombs <- data.table(val=sample(1:20,15), t(combn(LETTERS[1:6], 2)))
setnames(allcombs, paste0("V",1:2), paste0("firm",1:2))
copy_sets = copy(allcombs)
allcombs
val firm1 firm2
1: 16 A B
2: 5 A C
3: 12 A D
4: 15 A E
5: 9 A F
6: 19 B C
7: 6 B D
8: 4 B E
9: 2 B F
10: 7 C D
11: 14 C E
12: 10 C F
13: 11 D E
14: 20 D F
15: 13 E F
I can do this with a loop:
all_elements = unique(c(allcombs$firm1, allcombs$firm2))
selected_pairs = data.table()
while (length(all_elements) > 0){
selected_pairs <- rbind(selected_pairs, allcombs[allcombs[firm1 == all_elements[1] | firm2 == all_elements[1], .I[which.max(val)]]])
all_elements <- setdiff(all_elements, unlist(allcombs[allcombs[firm1 == all_elements[1] | firm2 == all_elements[1], .I[which.max(val)]],.(firm1,firm2)]))
allcombs <- allcombs[firm1 %in% all_elements & firm2 %in% all_elements]
}
Here is what I wanted:
selected_pairs
val firm1 firm2
1: 16 A B
2: 14 C E
3: 20 D F
Any help appreciated!
Upvotes: 0
Views: 92
Reputation: 42564
If I understand correctly, the OP wants to select the unique best combinations where every firm appears only once.
The code below picks the combination with the highest val
, marks all other rows which contain firm1
or firm2
as done and iteratively continues with the remaining rows until all rows are done. Bookeeping is done by updating the rank
column by reference, i.e., without copying.
d <- copy(allcombs)
setorder(d, -val)
d[, rank := NA_integer_]
r = 0L
remain <- d[, .I]
while (length(remain) > 0) {
r <- r + 1L
idx <- remain[d[remain, which.max(val)]]
d[idx, rank := r]
lut <- d[idx, .(firm = c(firm1, firm2), rank = NA_integer_)]
d[lut, on = c("firm1==firm", "rank"), rank := 0]
d[lut, on = c("firm2==firm", "rank"), rank := 0]
remain <- d[, .I[is.na(rank)]]
}
d[rank > 0]
val firm1 firm2 rank <int> <char> <char> <int> 1: 20 D F 1 2: 19 B C 2 3: 15 A E 3
Note that the result here differs from OP's result as the data.table is ordered by decreasing val
while OP's code iterates over the rows in order in which the company names appear in firm1
and firm2
.
I find this arbitrary and not conclusive. OP's approach will select only suboptima within the combinations of the current firm1
instance but not the overall optimum of all the remaining rows.
Here is a simplified version of above code which uses an additional row id rn
column instead of the remain
vector:
d <- copy(allcombs)
d[, rank := NA_integer_] # append bookkeeping column
d[, rn := .I] # append row id
r = 0L
while (any(is.na(d$rank))) {
r <- r + 1L
idx <- d[is.na(rank), rn[which.max(val)]]
d[idx, rank := r]
lut <- d[idx, .(firm = c(firm1, firm2), rank = NA_integer_)]
d[lut, on = c("firm1==firm", "rank"), rank := 0L]
d[lut, on = c("firm2==firm", "rank"), rank := 0L]
}
d[rank > 0]
Out of curiosity I have tried to reproduce OP's expected result. So, here is a variant of the code above which loops over the unique company names:
d <- copy(allcombs)
firms <- d[, unique(c(firm1, firm2))]
# firms <- rev(d[, unique(c(firm1, firm2))])
d[, rank := NA_integer_]
d[, rn := .I] # append row id
r = 0L
for (f in firms) {
r <- r + 1L
idx <- d[is.na(rank) & (firm1 == f | firm2 == f), rn[which.max(val)]]
d[idx, rank := r]
lut <- d[idx, .(firm = c(firm1, firm2), rank = NA_integer_)]
d[lut, on = c("firm1==firm", "rank"), rank := 0L]
d[lut, on = c("firm2==firm", "rank"), rank := 0L]
if (!any(is.na(d$rank))) break
}
d[rank > 0]
val firm1 firm2 rank rn
1: 16 A B 1 1
2: 14 C E 3 11
3: 20 D F 4 14
As mentioned above, the result may depend on the order of appearance of company names in firm1
and firm2
.
This can be demonstrated by reversing the order of company names by
firms <- rev(d[, unique(c(firm1, firm2))])
Now, the code returns
val firm1 firm2 rank rn
1: 15 A E 2 4
2: 19 B C 4 6
3: 20 D F 1 14
The bookkeeping columns have not been removed for demonstration.
Upvotes: 1