Reputation: 23
Can anyone help me out with some effectively good algorithm to carry out the following task:
I got a file of unique row numbers with an array of integer numbers per row.
I need to check each single row for the values of an array that show up in different rows and put them in one group. Here is an example how it may look:
Row Number; Array of data[...]
L1; [1,2,3,4,5]
L2; [2,3]
L3: [8,9]
L4: [6]
L5; [7]
L6; [5,6]
Based on these input data, I expect the algorithm to produce the result:
Group N; Array of rows [...]
G1; [L1,L2,L4,L6]
G2; [ L3]
G3; [ L5]
P.S the original dataset accounts for hundreds of millions of rows and can contain close to a million of array elements... time efficiency is a concern.
Thanks
Upvotes: 2
Views: 804
Reputation: 33509
I believe this is equivalent to finding connected components of a graph in which:
This can be done efficiently using a disjoint set data structure as follows:
MakeSet(d)
for each of the data values d (1,2,3,4,5,6,7,8,9 in your example)join(A[0],A[i])
for each choice of i.This will produce a set for each connected component. You can then produce your output array by iterating over the rows a second time:
set output to an array of empty lists
for each row r
A = array for row r
id = find(A[0])
output[id].append(r)
from collections import defaultdict
data=[[1,2,3,4,5],
[2,3],
[8,9],
[6],
[7],
[5,6]]
N=max(max(A) for A in data)
rank=[0]*(N+1)
parent=range(N+1)
def Find(x):
"""Find representative of connected component"""
if parent[x] != x:
parent[x] = Find(parent[x])
return parent[x]
def Union(x,y):
"""Merge sets containing elements x and y"""
x = Find(x)
y = Find(y)
if x == y:
return
if rank[x]<rank[y]:
parent[x] = y
elif rank[x]>rank[y]:
parent[y] = x
else:
parent[y] = x
rank[x] += 1
# First join all data
for row,A in enumerate(data):
for x in A:
Union(A[0],x)
# Then place rows into sets
D=defaultdict(list)
for row,A in enumerate(data):
D[Find(A[0])].append(row+1)
# Then display output
for i,L in enumerate(D.values()):
print i+1,L
Running this code prints the output:
1 [3]
2 [1, 2, 4, 6]
3 [5]
Upvotes: 2