Reputation: 561
I have a number of ordered lists (or sequences, or vectors, or data table columns) 1, 2, 3, with several items, for example
1 2 3
A A B
G G A
F F G
C E
D C
D
How can I efficiently derive the "master" list which contains all elements in the correct order B, A, G, F, E, C, D? I don't even know what keywords to search for. Any hints are much appreciated.
Upvotes: 2
Views: 268
Reputation: 50718
How about a graph-based approach.
The idea is to translate the sequences into paths in a directed graph (so A G F C D
becomes a path A->G->F->C->D
). By simplifying the graph we can then identify the longest connected sequence in that graph, which should then correspond to your "master" sequence.
Note that I assume your sample data lst
to be a list
of vectors (see sample data at the end of this answer).
Let's construct an igraph
from the different path
s; each path is given by the entries in the lst
vectors.
library(igraph)
ig <- make_empty_graph(
n = length(unique(unlist(lst))),
directed = TRUE) %>%
set_vertex_attr("name", value = sort(unique(unlist(lst))))
for (i in 1:length(lst)) ig <- ig + path(lst[[i]])
Next we simplify
the graph
ig <- simplify(ig)
It's instructive to plot the graph
plot(ig)
We now extract all simple paths; the longest simple path corresponds to the "master" list.
pths <- sapply(V(ig), function(x) {
p <- all_simple_paths(ig, x)
names(unlist(p[which.max(lengths(p))]))
})
pths[which.max(lengths(pths))]
$B
#[1] "B" "A" "G" "F" "E" "C" "D"
The sequence matches your expected output for the master list.
v1 <- c("A","G","F","C","D","D")
v2 <- c("A","G","F","E","C")
v3 <- c("B", "A","G")
lst <- list(v1, v2, v3)
Upvotes: 1
Reputation: 66870
Interesting problem. I think I have a working solution.
My thinking was that we can encode the vectors into a matrix to track which letters must come before and after each other letter by logic. Then we should be able to sort that matrix to find a working order.
Here, I take the three vectors and encode their implied ordering using nested loops.
v1 <- c("A","G","F","C","D","D")
v2 <- c("A","G","F","E","C")
v3 <- c("B", "A","G")
vecs <- list(v1, v2, v3)
unique_ltrs <- unique(unlist(vecs))
ltr_len <- length(unique_ltrs)
m <- matrix(0, nrow = ltr_len, ncol = ltr_len,
dimnames = list(unique_ltrs, unique_ltrs))
# Loops to populate m with what we know
for (v in 1:length(vecs)) {
vec <- unique(unlist(vecs[v]))
for (l in 1:length(vec)) {
for (l2 in 1:length(vec)) {
m_pos <- c(match(vec[l], unique_ltrs),
match(vec[l2], unique_ltrs))
compare <- ifelse(l < l2, -1, ifelse(l2 < l, 1, 0))
m[m_pos[1], m_pos[2]] <- compare
}
}
}
Here, 1 indicates the column letter comes before the row letter, while -1 means the row comes first.
> m
A G F C D E B
A 0 -1 -1 -1 -1 -1 1
G 1 0 -1 -1 -1 -1 1
F 1 1 0 -1 -1 -1 0
C 1 1 1 0 -1 1 0
D 1 1 1 1 0 0 0
E 1 1 1 -1 0 0 0
B -1 -1 0 0 0 0 0
Then we sort the matrix (relying on the code here), and a working order appears in the rownames:
m_ord <- m[do.call(order, as.data.frame(m)),]
#> m_ord
# A G F C D E B
#B -1 -1 0 0 0 0 0
#A 0 -1 -1 -1 -1 -1 1
#G 1 0 -1 -1 -1 -1 1
#F 1 1 0 -1 -1 -1 0
#E 1 1 1 -1 0 0 0
#C 1 1 1 0 -1 1 0
#D 1 1 1 1 0 0 0
rownames(m_ord)
#[1] "B" "A" "G" "F" "E" "C" "D"
Upvotes: 0