Reputation: 33
I am working on a problem where I have to group related items and assign a unique id to them. I have written the code in python but it is not returning the expected output. I need assistance in refining my logic. The code is below:
data = {}
child_list = []
for index, row in df.iterrows():
parent = row['source']
child = row['target']
#print 'Parent: ', parent
#print 'Child:', child
child_list.append(child)
#print child_list
if parent not in data.keys():
data[parent] = []
if parent != child:
data[parent].append(child)
#print data
op = {}
gid = 0
def recursive(op,x,gid):
if x in data.keys() and data[x] != []:
for x_child in data[x]:
if x_child in data.keys():
op[x_child] = gid
recursive(op,x_child,gid)
else:
op[x] = gid
else:
op[x] = gid
for key in data.keys():
#print "Key: ", key
if key not in child_list:
gid = gid + 1
op[key] = gid
for x in data[key]:
op[x] = gid
recursive(op,x,gid)
related = pd.DataFrame({'items':op.keys(),
'uniq_group_id': op.values()})
mapped.sort_values('items')
Example below
Input:
source target
a b
b c
c c
c d
d d
e f
a d
h a
i f
Desired Output:
item uniq_group_id
a 1
b 1
c 1
d 1
h 1
e 2
f 2
i 2
My code gave me below output which is wrong.
item uniq_group_id
a 3
b 3
c 3
d 3
e 1
f 2
h 3
i 2
Another Example
Input:
df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
'target':['b','c','c','d','d','f','d','a','f','a']})
Desired Output:
item uniq_group_id
a 1
b 1
c 1
d 1
e 2
f 2
My code Output:
item uniq_group_id
e 1
f 1
The order of the rows or the group id does not matter. The important thing here is to assign related items a same unique identifier. The whole problem is to find related group of items and assign them a unique group id.
Upvotes: 2
Views: 997
Reputation: 82929
You can use the Union-Find, or Disjoint-Sets algorithm for this. See this related answer for a more complete explanation. Basically, you need two functions, union
and find
, to create a tree (i.e. a nested dictionary) of leaders
or predecessors:
leaders = collections.defaultdict(lambda: None)
def find(x):
l = leaders[x]
if l is not None:
l = find(l)
leaders[x] = l
return l
return x
def union(x, y):
lx, ly = find(x), find(y)
if lx != ly:
leaders[lx] = ly
You can apply this to your problem as follows:
df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
'target': ['b','c','c','d','d','f','d','a','f','a']})
# build the tree
for _, row in df.iterrows():
union(row["source"], row["target"])
# build groups based on leaders
groups = collections.defaultdict(set)
for x in leaders:
groups[find(x)].add(x)
for num, group in enumerate(groups.values(), start=1):
print(num, group)
Result:
1 {'e', 'f', 'i'}
2 {'h', 'a', 'c', 'd', 'b'}
Upvotes: 1
Reputation: 55489
I haven't analyzed your code closely, but it looks like the error is because of the way you populate the data
dictionary. It stores a child node as being a neighbor of its parent node, but it also needs to store the parent as being a neighbor of the child.
Rather than attempting to fix your code I decided to adapt this pseudocode written by Aseem Goyal. The code below takes its input data from simple Python lists, but it should be easy to adapt it to work with a Pandas dataframe.
''' Find all the connected components of an undirected graph '''
from collections import defaultdict
src = ['a', 'b', 'c', 'c', 'd', 'e', 'a', 'h', 'i', 'a']
tgt = ['b', 'c', 'c', 'd', 'd', 'f', 'd', 'a', 'f', 'a']
nodes = sorted(set(src + tgt))
print('Nodes', nodes)
neighbors = defaultdict(set)
for u, v in zip(src, tgt):
neighbors[u].add(v)
neighbors[v].add(u)
print('Neighbors')
for n in nodes:
print(n, neighbors[n])
visited = {}
def depth_first_traverse(node, group_id):
for n in neighbors[node]:
if n not in visited:
visited[n] = group_id
depth_first_traverse(n, group_id)
print('Groups')
group_id = 1
for n in nodes:
if n not in visited:
visited[n] = group_id
depth_first_traverse(n, group_id)
group_id += 1
print(n, visited[n])
output
Nodes ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i']
Neighbors
a {'a', 'd', 'b', 'h'}
b {'a', 'c'}
c {'d', 'b', 'c'}
d {'d', 'a', 'c'}
e {'f'}
f {'i', 'e'}
h {'a'}
i {'f'}
Groups
a 1
b 1
c 1
d 1
e 2
f 2
h 1
i 2
This code was written for Python 3, but will also run on Python 2. If you do run it on Python 2 you should add from __future__ import print_function
at the top of your import statements; it's not strictly necessary, but it will make the output look nicer.
Upvotes: 1