Basj
Basj

Reputation: 46303

Populate a dict with missing values

Let b be a dict with some values :

 b = {}  
 b[90, 1] = 100, b[90, 55] = 101, b[90, 127] = 102
 b[70, 1] = 40, b[70, 45] = 41, b[70, 107] = 42

How to, in one pass, populate a dict with its missing values as nearest neighbour, for example for 0 <= i <= 127, 0 <= j <= 127? (it will give 16384 keys in the dict, so that's ok for my application).

As an example, I would like b[73, 40] = b[70, 45] = 41, i.e.nearest neighboor in a 2D plane.

Here is what I tried :

for i in range(127):
  for j in range(127):
    closest_key = min(b.keys(), key=lambda c: (c[0] - i) ** 2 + (c[1] - j) ** 2)
    b[i, j] = b[closest_key]

But it is very slow probably because there are 127*127 loops in which we loop again once over all elements to compute the distance!

How can we populate a dict with missing values, with nearest neighbour, in a more efficient way?

Upvotes: 2

Views: 226

Answers (3)

JuniorCompressor
JuniorCompressor

Reputation: 20025

You are searching inside b for the closest key. But b doesn't contain only the original keys but also the new keys you are entering at each iteration. It will be faster and more correct to just check among the initial keys:

initial_keys = set(b.keys())
for i in xrange(127):
    for j in xrange(127):
        if (i, j) not in initial_keys:
            closest_key = min(
                initial_keys, key=lambda c: (c[0] - i) ** 2 + (c[1] - j) ** 2
            )
            b[i, j] = b[closest_key]

This way the running time of the algorithm drops to O(k * n^2) from O(n^4), where n is the dimension size and k the number of initial keys.

EDIT:

You can use numpy with great speedup improvement:

import numpy as np

s = set(b.keys())
x = np.array([k[0] for k in s])
y = np.array([k[1] for k in s])
for i in xrange(128):
    for j in xrange(128):
        if (i, j) not in s:
            argmin = np.argmin((x - i) ** 2 + (y - j) ** 2)
            b[i, j] = b[x[argmin], y[argmin]]

Upvotes: 2

loopbackbee
loopbackbee

Reputation: 23332

A dictionary is absolutely inappropriate for this kind of use - unless you're happy with O(n) complexity (and then, using a list would be more clear). There is a class of hashing functions that could, conceivably, be used to implement a appropriate "dictionary" - but python's dict is definitely not up to the task.

If you need proper performance, you'll need to use another data structure. The simplest one would be a K-d tree. There's an implementation inside scipy.

You may want to review the wikipedia article devoted to nearest neighbor search


Of course, you can use a dictionary as a cache if you're querying the same values repeatedly (as in Raydel Miranda's answer). But use it as a cache - not for storing/querying your actual data!

Upvotes: 1

Raydel Miranda
Raydel Miranda

Reputation: 14360

You could try to calculate on demand, and build a cache with results. The advantage of this approach is that if you don't need to use some point it will never be calculated.

Full example:

b = {}  
b[90, 1] = 100
b[90, 55] = 101
b[90, 127] = 102
b[70, 1] = 40
b[70, 45] = 41
b[70, 107] = 42

class NeighbourDist:

  def __init__(self, source_dict):
    # Original dict.
    self.__source_dict = source_dict
    # Dict used for cache.
    self.__cache_dict = {}


  def __calculate_distance(self, x0, x1, y0, y1):
    """ Calculate distance beetwen two points. """
    dx = x1 - x0
    dy = y1 - y0
    d = (dx**2 + dy**2)**0.5
    return d

  def __getitem__(self, key):
    """
    Look for the key in the cached dict, if not has been calculated yet
    then proceed to calculate it.
    Return the result and store in __cache_dict.
    """
    cached = self.__cache_dict.get(key)
    if cached is not None:
      return cached
    else:
      x0, y0 = key
      min_n = 0
      min_  = 1e100
      for (x1, y1) in self.__source_dict.keys():
        dist = self.__calculate_distance(x0, x1, y0, y1)
        if min_ > dist:
          min_ = dist
          min_n = self.__source_dict[x1, y1]
      self.__cache_dict[key] = min_n
      return min_n

if '__main__' == __name__:
  d = NeighbourDist(b)
  print(d[73, 40]) # >>> 41
  print(d[73, 40]) # >>> 41, Second time the result is obtained from the cached dict.

Upvotes: 0

Related Questions