Cost efficient algorithm to group array of sets

Question

Can anyone help me out with some effectively good algorithm to carry out the following task:

I got a file of unique row numbers with an array of integer numbers per row.

I need to check each single row for the values of an array that show up in different rows and put them in one group. Here is an example how it may look:

Row Number; Array of data[...]

L1; [1,2,3,4,5]
L2; [2,3]
L3: [8,9]
L4: [6]
L5; [7]
L6; [5,6]

Based on these input data, I expect the algorithm to produce the result:

Group N; Array of rows [...]

G1; [L1,L2,L4,L6]
G2; [ L3]
G3; [ L5]

P.S the original dataset accounts for hundreds of millions of rows and can contain close to a million of array elements... time efficiency is a concern.

Thanks

Peter de Rivaz · Accepted Answer

I believe this is equivalent to finding connected components of a graph in which:

The vertices correspond to the initial row numbers
There is an edge between two vertices x and y if there is a common element in the array for x and the array for y

This can be done efficiently using a disjoint set data structure as follows:

MakeSet(d) for each of the data values d (1,2,3,4,5,6,7,8,9 in your example)
For each row with array A, call join(A[0],A[i]) for each choice of i.

This will produce a set for each connected component. You can then produce your output array by iterating over the rows a second time:

set output to an array of empty lists
for each row r
    A = array for row r
    id = find(A[0])
    output[id].append(r)

Example Python Code

from collections import defaultdict
data=[[1,2,3,4,5],
      [2,3],
      [8,9],
      [6],
      [7],
      [5,6]]
N=max(max(A) for A in data)
rank=[0]*(N+1)
parent=range(N+1)

def Find(x):
    """Find representative of connected component"""
    if  parent[x] != x:
        parent[x] = Find(parent[x])
    return parent[x]

def Union(x,y):
    """Merge sets containing elements x and y"""
    x = Find(x)
    y = Find(y)
    if x == y:
        return
    if rank[x]rank[y]:
        parent[y] = x
    else:
        parent[y] = x
        rank[x] += 1

# First join all data
for row,A in enumerate(data):
    for x in A:
        Union(A[0],x)

# Then place rows into sets
D=defaultdict(list)
for row,A in enumerate(data):
    D[Find(A[0])].append(row+1)

# Then display output
for i,L in enumerate(D.values()):
    print i+1,L

Running this code prints the output:

1 [3]
2 [1, 2, 4, 6]
3 [5]

Cost efficient algorithm to group array of sets

Answers (1)

Example Python Code

Related Questions