Sam
Sam

Reputation: 33

Create unique ids for a group

I am working on a problem where I have to group related items and assign a unique id to them. I have written the code in python but it is not returning the expected output. I need assistance in refining my logic. The code is below:

data = {}
child_list = []


for index, row in df.iterrows():
    parent = row['source']
    child = row['target']
    #print 'Parent: ', parent
    #print 'Child:', child
    child_list.append(child)
    #print child_list
    if parent not in data.keys():
        data[parent] = []
    if parent != child:
        data[parent].append(child)
    #print data

op = {}
gid = 0


def recursive(op,x,gid):
    if x in data.keys() and data[x] != []:
        for x_child in data[x]:
            if x_child in data.keys():
                op[x_child] = gid
                recursive(op,x_child,gid)
            else:
                op[x] = gid
    else:
        op[x] = gid


for key in data.keys():
    #print "Key: ", key
    if key not in child_list:
        gid = gid + 1
        op[key] = gid
        for x in data[key]:
            op[x] = gid
            recursive(op,x,gid)

related = pd.DataFrame({'items':op.keys(),
                  'uniq_group_id': op.values()})
mapped.sort_values('items')

Example below

Input:
source  target
a        b
b        c
c        c
c        d
d        d
e        f
a        d
h        a
i        f  

Desired Output:
item     uniq_group_id
a         1 
b         1
c         1
d         1
h         1
e         2
f         2
i         2

My code gave me below output which is wrong.

item    uniq_group_id
a       3
b       3
c       3
d       3
e       1
f       2
h       3
i       2 

Another Example

Input:
df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
                'target':['b','c','c','d','d','f','d','a','f','a']})
Desired Output:
item    uniq_group_id
a       1
b       1
c       1
d       1
e       2
f       2

My code Output:
item    uniq_group_id
e   1
f   1

The order of the rows or the group id does not matter. The important thing here is to assign related items a same unique identifier. The whole problem is to find related group of items and assign them a unique group id.

Upvotes: 2

Views: 997

Answers (2)

tobias_k
tobias_k

Reputation: 82929

You can use the Union-Find, or Disjoint-Sets algorithm for this. See this related answer for a more complete explanation. Basically, you need two functions, union and find, to create a tree (i.e. a nested dictionary) of leaders or predecessors:

leaders = collections.defaultdict(lambda: None)

def find(x):
    l = leaders[x]
    if l is not None:
        l = find(l)
        leaders[x] = l
        return l
    return x

def union(x, y):
    lx, ly = find(x), find(y)
    if lx != ly:
        leaders[lx] = ly

You can apply this to your problem as follows:

df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'],
                   'target': ['b','c','c','d','d','f','d','a','f','a']})

# build the tree
for _, row in df.iterrows():
    union(row["source"], row["target"])

# build groups based on leaders
groups = collections.defaultdict(set)
for x in leaders:
    groups[find(x)].add(x)
for num, group in enumerate(groups.values(), start=1):
    print(num, group)

Result:

1 {'e', 'f', 'i'}
2 {'h', 'a', 'c', 'd', 'b'}

Upvotes: 1

PM 2Ring
PM 2Ring

Reputation: 55489

I haven't analyzed your code closely, but it looks like the error is because of the way you populate the data dictionary. It stores a child node as being a neighbor of its parent node, but it also needs to store the parent as being a neighbor of the child.

Rather than attempting to fix your code I decided to adapt this pseudocode written by Aseem Goyal. The code below takes its input data from simple Python lists, but it should be easy to adapt it to work with a Pandas dataframe.

''' Find all the connected components of an undirected graph '''

from collections import defaultdict

src = ['a', 'b', 'c', 'c', 'd', 'e', 'a', 'h', 'i', 'a']
tgt = ['b', 'c', 'c', 'd', 'd', 'f', 'd', 'a', 'f', 'a']

nodes = sorted(set(src + tgt))
print('Nodes', nodes)

neighbors = defaultdict(set)
for u, v in zip(src, tgt):
    neighbors[u].add(v)
    neighbors[v].add(u)

print('Neighbors')
for n in nodes:
    print(n, neighbors[n])

visited = {}
def depth_first_traverse(node, group_id):
    for n in neighbors[node]:
        if n not in visited:
            visited[n] = group_id
            depth_first_traverse(n, group_id)

print('Groups')
group_id = 1
for n in nodes:
    if n not in visited:
        visited[n] = group_id
        depth_first_traverse(n, group_id)
        group_id += 1
    print(n, visited[n])

output

Nodes ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i']
Neighbors
a {'a', 'd', 'b', 'h'}
b {'a', 'c'}
c {'d', 'b', 'c'}
d {'d', 'a', 'c'}
e {'f'}
f {'i', 'e'}
h {'a'}
i {'f'}
Groups
a 1
b 1
c 1
d 1
e 2
f 2
h 1
i 2

This code was written for Python 3, but will also run on Python 2. If you do run it on Python 2 you should add from __future__ import print_function at the top of your import statements; it's not strictly necessary, but it will make the output look nicer.

Upvotes: 1

Related Questions