Finding the most similar list out of a collection of lists, online

Question

I have a collection of lists that each represent a path on a graph:

list1 = [1,2,3,6] # directed edge from 1->2, 2->3, 3->6
list2 = [8,3,5,6]
list3 = [9,1,3,4]
list4 = [7,8,1,4]

I also have the adjacency matrix of the graph.

At each time step I am given an edge: e.g. time step 0: [1,2], time step 1: [3,6], and at each time step I have to find the most similar list, taking into account previous time steps. Meaning, the list that is the most complete.

What is an efficient way to do it?

I tried to use a naive approach, of comparing the incoming edges to every edge in every list, but considering I have a large number of lists, each with a large number of edges, this is too slow.

Update: writing an example input and output at each time step.

time step 0: input [1,2], output: list1

time step 1: input [8,3], output: list1, list2 #equally complete

time step 2: input [3,6], output: list1

Update 2: Thanks to @Nuclearman I coded (maybe naively?) the solution

list1 = [1,2,3,6] # directed edge from 1->2, 2->3, 3->6
list2 = [8,3,5,6]
list3 = [9,1,3,4]
list4 = [7,8,1,4]

lists_dict = {
    'list1' : list1,
    'list2' : list2,
    'list3' : list3,
    'list4' : list4
}


edges = {
    'list1' : len(list1) - 1,
    'list2' : len(list2) - 1,
    'list3' : len(list3) - 1,
    'list4' : len(list4) - 1
}

covered_edges = {
    'list1' : 0,
    'list2' : 0,
    'list3' : 0,
    'list4' : 0
}

completeness = {
    'list1' : covered_edges['list1']/edges['list1'],
    'list2' : covered_edges['list2']/edges['list2'],
    'list3' : covered_edges['list3']/edges['list3'],
    'list4' : covered_edges['list4']/edges['list4']
}

graph = {}

for list_name in lists_dict.keys():
    idx = 0
    
    while idx < len(lists_dict[list_name])-1:
        
        node1 = lists_dict[list_name][idx]
        node2 = lists_dict[list_name][idx+1]

        if node1 in graph.keys(): #if exist
            graph[node1][node2] =  list_name
            
        else: #if doesnt exist
            graph[node1] = {node2: list_name}
        
        idx+=1
        
times= [[1,2],[3,5],[5,6],[8,1],[7,8]]
for time in times:
    edge_in_list = graph[time[0]][time[1]] #list name

    covered_edges[edge_in_list] +=1
    print(covered_edges)
    
    completeness = {
    'list1' : covered_edges['list1']/edges['list1'],
    'list2' : covered_edges['list2']/edges['list2'],
    'list3' : covered_edges['list3']/edges['list3'],
    'list4' : covered_edges['list4']/edges['list4']
    }
    
    mx = max(completeness.values())
    max_list = [k for k, v in completeness.items() if v == mx]
    
    print(max_list)
    print('')

Nuclearman · Accepted Answer

Try using an adjacency list setup as a nested hash to represent the graphs

IE: the entire example you have could be setup this way (don't recall if this is valid python):

graph = {
  1: {2: [1], 3: [3], 4: [4] },
  2: {3: [1] },
  3: {6: [1], 5: [2], 4: [3] },
  5: {6: [2] },
  7: {8: [4] },
  8: {3: [2], 1: [4] },
  9: {1: [3] },
}

Then you just keep a tally of how many remain in each list and store that into a data structure with O(log N) or better find-min (or find-max just adjust the key), lookup, add item and remove item. You may need to do some math depending on how you define completeness. IE: you may need to store both the total edges and the covered edges and then use [(total - covered) / total, list #] or as the key for the data structure. That way you can quickly find the list even if there are multiple lists with the same completeness. For the result you want, return all lists with the highest completeness.

The above graph let's you quickly determine which list(s) each edge goes to, you then lookup that edge in the remaining counts and decrease the count by one for each list. IE: You can see graph[1][2] is list 1, graph[8][3] is list 2 and graph[3][6] is list 1 as well.

For performance, you may also want to keep a set of already seen edges and skip any already seen edges, though that may or may not be needed, and may or may not be how you want to handle it.

Performance is proportional to the number of lists you need to change, making it output sensitive. If the example provided is anything to go on, the number of list counts you need to update for each incoming edge is likely very small compare to the number of lists. If there are E total edges in all the L lists and you need to process K edges online and those K edges result in processing a total A lists (A is an output sensitive variable and depends on how much overlap is between the lists, the example you gave has zero overlap as each edge has a single list associated with it, but unclear if that would remain with more lists and edges). Then the performance is O(E + K + AlogL) I believe, since those A processed lists each require a log L lookup to find + update the list count. E is the preprocessing required to build the graph. This seems basically optimial unless there is something else. Likely much better than O(K*E) you currently have, unless you have extremely high overlap (A).

Finding the most similar list out of a collection of lists, online

Answers (1)

Related Questions