BenP
BenP

Reputation: 845

Combine python dictionaries that share values and keys

I am doing some entity matching based on string edit distance and my results are a dictionary with keys (query string) and values [list of similar strings] based on some scoring criteria.

for example:

results = {
  'ben' : ['benj', 'benjamin', 'benyamin'],
  'benj': ['ben', 'beny', 'benjamin'],
  'benjamin': ['benyamin'],
  'benyamin': ['benjamin'],
  'carl': ['karl'],
  'karl': ['carl'],
}

Each value also has a corresponding dictionary item, for which it is the key (e.g. 'carl' and 'karl').

I need to combine the elements that have shared values. Choosing one value as the new key (lets say the longest string). In the above example I would hope to get:

results = {
  'benjamin': ['ben', 'benj', 'benyamin', 'beny', 'benjamin', 'benyamin'],
  'carl': ['carl','karl']
}

I have tried iterating through the dictionary using the keys, but I can't wrap my head around how to iterate and compare through each dictionary item and its list of values (or single value).

Upvotes: 3

Views: 211

Answers (2)

jpp
jpp

Reputation: 164623

This is one solution using collections.defaultdict and sets.

The desired output is very similar to what you have, and can be easily manipulated to align.

from collections import defaultdict

results = {
  'ben' : ['benj', 'benjamin', 'benyamin'],
  'benj': ['ben', 'beny', 'benjamin'],
  'benjamin': 'benyamin',
  'benyamin': 'benjamin',
  'carl': 'karl',
  'karl': 'carl',
}

d = defaultdict(set)

for i, (k, v) in enumerate(results.items()):
    w = {k} | (set(v) if isinstance(v, list) else {v})
    for m, n in d.items():
        if not n.isdisjoint(w):
            d[m].update(w)
            break
    else:
        d[i] = w

result = {max(v, key=len): v for k, v in d.items()}

# {'benjamin': {'ben', 'benj', 'benjamin', 'beny', 'benyamin'},
#  'carl': {'carl', 'karl'}}

Credit to @IMCoins for the idea of manipulating v to w in second loop.

Explanation

There are 3 main steps:

  1. Convert values into a consistent set format, including keys and values from original dictionary.
  2. Cycle through this dictionary and add values to a new dictionary. If there is an intersection with some key [i.e. sets are not disjoint], then use that key. Otherwise, add to new key determined via enumeration.
  3. Create result dictionary in a final transformation by mapping max length key to values.

Upvotes: 3

IMCoins
IMCoins

Reputation: 3306

EDIT : Even though performance was not the question here, I took the liberty to perform some tests between jpp's answer, and mine... here is the full script. My script performs the tests in 17.79 seconds, and his in 23.5 seconds.

import timeit

results = {
  'ben' : ['benj', 'benjamin', 'benyamin'],
  'benj': ['ben', 'beny', 'benjamin'],
  'benjamin': ['benyamin'],
  'benyamin': ['benjamin'],
  'carl': ['karl'],
  'karl': ['carl'],
}

def imcoins(result):
    new_dict = {}
    # .items() for python3x
    for k, v in results.iteritems():
        flag = False
        #   Checking if key exists...
        if k not in new_dict.keys():
            #   But then, we also need to check its values.
            for item in v:
                if item in new_dict.keys():
                    #   If we update, set the flag to True, so we don't create a new value.
                    new_dict[item].update(v)
                    flag = True
            if flag == False:
                new_dict[k] = set(v)

    #   Now, to sort our newly created dict...
    sorted_dict = {}
    for k, v in new_dict.iteritems():
        max_string = max(v)
        if len(max_string) > len(k):
            sorted_dict[max(v, key=len)] = set(v)
        else:
            sorted_dict[k] =  v

    return sorted_dict

def jpp(result):
    from collections import defaultdict

    res = {i: {k} | (set(v) if isinstance(v, list) else {v}) \
          for i, (k, v) in enumerate(results.items())}

    d = defaultdict(set)

    for i, (k, v) in enumerate(res.items()):
        for m, n in d.items():
            if n & v:
                d[m].update(v)
                break
        else:
            d[i] = v

    result = {max(v, key=len): v for k, v in d.items()}
    return result

iterations = 1000000
time1 = timeit.timeit(stmt='imcoins(results)', setup='from __main__ import imcoins, results', number=iterations)
time2 = timeit.timeit(stmt='jpp(results)', setup='from __main__ import jpp, results', number=iterations)

print time1 # Outputs : 17.7903265883
print time2 # Outputs : 23.5605850732

If I move the import from his function to global scope, it gives...

imcoins : 13.4129249463 seconds

jpp : 21.8191823393 seconds

Upvotes: 2

Related Questions