Amelio Vazquez-Reina
Amelio Vazquez-Reina

Reputation: 96302

Finding all the overlapping groups of dictionary keys

Say I have a dictionary of lists in Python. I would like to find all the groups of keys that have items in common, and for each such group, the corresponding items.

For example, assuming that the items are simple integers:

dct      = dict()
dct['a'] = [0, 5, 7]
dct['b'] = [1, 2, 5]
dct['c'] = [3, 2]
dct['d'] = [3]
dct['e'] = [0, 5]

The groups would be:

groups    = dict()
groups[0] = ['a', 'e']
groups[1] = ['b', 'c']
groups[2] = ['c', 'd']
groups[3] = ['a', 'b', 'e']

And the elements in common for those groups would be:

common    = dict()
common[0] = [0, 5]
common[1] = [2]
common[2] = [3]
common[3] = [5]

To solve this problem, I believe that there is value in building a matrix like the one below, but I am not sure how to proceed from this point. Are there any Python libraries that facilitate solving this type of problem?

   | a  b  c  d  e |
|a|  x           x |
|b|     x  x     x |
|c|     x  x  x    |
|d|        x  x    |
|e|  x  x        x |

Update

I tried to wrap up the solution that @NickBurns provided within a function, but I am having problems reproducing the solution:

dct = { 'a' : [0, 5, 7], 'b' : [1, 2, 5], 'c' : [3, 2], 'd' : [3], 'e' : [0, 5]}

groups, common_items = get_groups(dct)
print 'Groups', groups
print 'Common items',  common_items

I get:

Groups: defaultdict(<type 'list'>, {0: ['a', 'e'], 2: ['c', 'b'], 3: ['c', 'd'], 5: ['a', 'b', 'e']})                                                        

Common items: {0: None, 2: None, 3: None, 5: None}

And here are the functions

from collections import defaultdict
def common(query_group, dct):
    """ Recursively find the common elements within groups """
    if len(query_group) <= 1:
        return
    # Extract the elements from groups,
    # Pull their original values from dct
    # Get the intersection of these
    first, second = set(dct[query_group[0]]), set(dct[query_group[1]])  
    # print(first.intersection(second))
    return common(query_group[2:], dct)


def get_groups (dct):
  groups = defaultdict(list)

  for key, values in dct.items():
    for value in values:
      groups[value].append(key)

  # Clean up the groups:      
  for key in groups.keys():
    # i.e. the value is common to more than 1 group
    if len(groups[key]) <= 1:    
      del groups[key]

  # Identify common elements:
  common_items = dict()
  for k,v in groups.iteritems():
    if len(v) > 1:
      common_items[k] = common(v, dct)

  return groups, common_items

Upvotes: 2

Views: 1701

Answers (4)

Nick Burns
Nick Burns

Reputation: 983

I would try to create a second dictionary (groups) that represents the intersection of each list in the original dct. For example, youu could do this using a defaultdict something like:

from collections import defaultdict
groups = defaultdict(list)
dct = { 'a' : [0, 5, 7], 'b' : [1, 2, 5], 'c' : [3, 2], 'd' : [3], 'e' : [0, 5]}
for key, values in dct.items():
    for value in values:
        groups[value].append(key)

for key in groups.keys():
    if len(groups[key]) > 1:    # i.e. the value is common to more than 1 group
        print(key, groups[key])

(0, ['a', 'e'])
(2, ['c', 'b'])
(3, ['c', 'd'])
(5, ['a', 'b', 'e'])

Finding the common elements is a little messier, you need to run through each group and find the intersection from the original dct. Perhaps a recursive routine like this would work:

def common(query_group, dct, have_common=[]):
    """ Recursively find the common elements within groups """

    if len(query_group) <= 1:
        return have_common

    # extract the elements from groups, and pull their original values from dct
    # then get the intersection of these
    first, second = set(dct[query_group[0]]), set(dct[query_group[1]])
    have_common.extend(first.intersection(second))

    return common(query_group[2:], dct, have_common)

for query_group in groups.values():
    if len(query_group) > 1:
        print(query_group, '=>', common(query_group, dct, have_common=[]))

['e', 'a'] => [0, 5]    
['b', 'c'] => [2]    
['d', 'c'] => [3]    
['e', 'b', 'a'] => [5}]

Clearly it needs some prettier formatting, but I think it gets the job done. Hopefully that helps.

Upvotes: 3

Jaime
Jaime

Reputation: 67427

This is a big mess, but it works. It is basically building an array like this:

  | 0 1 2 3 4 5 6 7 |
  +-----------------+
|a| 1 0 0 0 1 0 0 1 |
|b| 0 1 1 0 0 1 0 0 |
|c| 0 0 1 1 0 0 0 0 |
|d| 0 0 0 1 0 0 0 0 |
|e| 1 0 0 0 0 1 0 0 |

The groups are the unique columns with more than one 1. To find all common elements to a group, you find the columns that have 1s in the places where the group definition has 1s. And writing it in Python, with a hacky use of scipy's sparse matrices to build the above array, I got the following:

import numpy as np
import scipy.sparse as sps

dct = {'a' : [0, 5, 7], 'b' : [1, 2, 5], 'c' : [3, 2],
       'd' : [3], 'e' : [0, 5]}

keys = []
lens = []
vals = []

for key, items in dct.items():
    keys.append(key)
    lens.append(len(items))
    vals.extend(items)

keys = np.array(keys)
lens = np.array(lens)
vals = np.array(vals)
unique_values, val_idx = np.unique(vals, return_inverse=True)

data = np.ones_like(val_idx)
indices = val_idx
indptr = np.concatenate(([0], np.cumsum(lens)))

dct_array = sps.csr_matrix((data, indices, indptr))
dct_array = dct_array.T.toarray()
mask = dct_array.sum(axis=-1) >= 2
dct_array = dct_array[mask].astype(np.bool)
unique_values = unique_values[mask]

dct_array = np.ascontiguousarray(dct_array)
dct_array = dct_array.view((np.void,
                            (dct_array.dtype.itemsize *
                             len(keys)))).ravel()
groups, grp_idx = np.unique(dct_array,
                            return_index=True)
groups = groups.view(np.bool).reshape(-1, len(keys))
dct_array = dct_array.view(np.bool).reshape(-1, len(keys))

for group, idx in zip(groups, grp_idx) :
    print 'group {0}'.format(keys[group])
    common = unique_values[np.all(np.logical_and(dct_array[idx],
                                                 dct_array) ==
                                  dct_array[idx], axis=-1)]
    print 'common {0}'.format(common)

This prints out:

group ['c' 'd']
common [3]
group ['c' 'b']
common [2]
group ['a' 'e']
common [0 5]
group ['a' 'b' 'e']
common [5]

Upvotes: 1

Brionius
Brionius

Reputation: 14098

This is pretty close to what you're asking for - take a look at it, and see if it's close enough.

from collections import defaultdict

dct = dict()
dct['a'] = [0, 5, 7]
dct['b'] = [1, 2, 5]
dct['c'] = [3, 2]
dct['d'] = [3]
dct['e'] = [0, 5]

inverseDict = defaultdict(list)
for key in dct:
    for item in dct[key]:
        inverseDict[item].append(key)
for item in inverseDict.keys():
    if len(inverseDict[item]) < 2:
        del inverseDict[item]

for item in inverseDict:
    print item, ":", inverseDict[item]

Output:

0 : ['a', 'e']
2 : ['c', 'b']
3 : ['c', 'd']
5 : ['a', 'b', 'e']

Upvotes: 2

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250971

You can use the NetworkX library to get that matrix(Adjacency matrix) representation:

import networkx as nx
dct = { 'a' : [0, 5, 7], 'b' : [1, 2, 5], 'c' : [3, 2], 'd' : [3], 'e' : [0, 5]}
nodes = sorted(dct)

G = nx.Graph()
for node in nodes:
    attached_nodes = dct[node]
    G.add_node(node)
    for nod in attached_nodes:
        if 0 <= nod < len(nodes):
            G.add_edge(node, nodes[nod])

print G.nodes()
print G.edges()
print G.has_edge('a','b')
print G.has_edge('b','c')

Output:

['a', 'c', 'b', 'e', 'd']
[('a', 'a'), ('a', 'e'), ('c', 'c'), ('c', 'b'), ('c', 'd'), ('b', 'b'), ('d', 'd')]
False
True

Upvotes: 2

Related Questions