schoon
schoon

Reputation: 3334

Algorithm for matching objects

I have 1,000 objects, each object has 4 attribute lists: a list of words, images, audio files and video files.

I want to compare each object against:

  1. a single object, Ox, from the 1,000.
  2. every other object.

A comparison will be something like: sum(words in common+ images in common+...).

I want an algorithm that will help me find the closest 5, say, objects to Ox and (a different?) algorithm to find the closest 5 pairs of objects

I've looked into cluster analysis and maximal matching and they don't seem to exactly fit this scenario. I don't want to use these method if something more apt exists, so does this look like a particular type of algorithm to anyone, or can anyone point me in the right direction to applying the algorithms I mentioned to this?

Upvotes: 0

Views: 649

Answers (2)

PaulMag
PaulMag

Reputation: 4148

I made an example program for how to solve your first question. But you have to implement ho you want to compare images, audio and videos. And I assume every object has the same length for all lists. To answer your question number two it would be something similar, but with a double loop.

import numpy as np
from random import randint

class Thing:

    def __init__(self, words, images, audios, videos):
        self.words  = words
        self.images = images
        self.audios = audios
        self.videos = videos

    def compare(self, other):
        score = 0
        # Assuming the attribute lists have the same length for both objects
        # and that they are sorted in the same manner:
        for i in range(len(self.words)):
            if self.words[i] == other.words[i]:
                score += 1
        for i in range(len(self.images)):
            if self.images[i] == other.images[i]:
                score += 1
        # And so one for audio and video. You have to make sure you know
        # what method to use for determining when an image/audio/video are
        # equal.
        return score


N = 1000
things = []
words  = np.random.randint(5, size=(N,5))
images = np.random.randint(5, size=(N,5))
audios = np.random.randint(5, size=(N,5))
videos = np.random.randint(5, size=(N,5))
# For testing purposes I assign each attribute to a list (array) containing
# five random integers. I don't know how you actually intend to do it.
for i in xrange(N):
    things.append(Thing(words[i], images[i], audios[i], videos[i]))

# I will assume that object number 999 (i=999) is the Ox:
ox = 999
scores = np.zeros(N - 1)
for i in xrange(N - 1):
    scores[i] = (things[ox].compare(things[i]))

best = np.argmax(scores)
print "The most similar thing is thing number %d." % best
print
print "Ox attributes:"
print things[ox].words
print things[ox].images
print things[ox].audios
print things[ox].videos
print
print "Best match attributes:"
print things[ox].words
print things[ox].images
print things[ox].audios
print things[ox].videos

EDIT:

Now here is the same program modified sligthly to answer your second question. It turned out to be very simple. I basically just needed to add 4 lines:

  1. Changing scores into a (N,N) array instead of just (N).
  2. Adding for j in xrange(N): and thus creating a double loop.
  3. if i == j:
  4. break

where 3. and 4. is just to make sure that I only compare each pair of things once and not twice and don't compary any things with themselves.

Then there is a few more lines of code that is needed to extract the indices of the 5 largest values in scores. I also reformated the printing so it will be easy to confirm by eye that the printed pairs are actually very similar.

Here comes the new code:

import numpy as np

class Thing:

    def __init__(self, words, images, audios, videos):
        self.words  = words
        self.images = images
        self.audios = audios
        self.videos = videos

    def compare(self, other):
        score = 0
        # Assuming the attribute lists have the same length for both objects
        # and that they are sorted in the same manner:
        for i in range(len(self.words)):
            if self.words[i] == other.words[i]:
                score += 1
        for i in range(len(self.images)):
            if self.images[i] == other.images[i]:
                score += 1
        for i in range(len(self.audios)):
            if self.audios[i] == other.audios[i]:
                score += 1
        for i in range(len(self.videos)):
            if self.videos[i] == other.videos[i]:
                score += 1
        # You have to make sure you know what method to use for determining
        # when an image/audio/video are equal.
        return score


N = 1000
things = []
words  = np.random.randint(5, size=(N,5))
images = np.random.randint(5, size=(N,5))
audios = np.random.randint(5, size=(N,5))
videos = np.random.randint(5, size=(N,5))
# For testing purposes I assign each attribute to a list (array) containing
# five random integers. I don't know how you actually intend to do it.
for i in xrange(N):
    things.append(Thing(words[i], images[i], audios[i], videos[i]))


################################################################################
############################# This is the new part: ############################
################################################################################
scores = np.zeros((N, N))
# Scores will become a triangular matrix where scores[i, j]=value means that
# value is the number of attrributes thing[i] and thing[j] have in common.
for i in xrange(N):
    for j in xrange(N):
        if i == j:
            break
            # Break the loop here because:
            # * When i==j we would compare thing[i] with itself, and we don't
            #   want that.
            # * For every combination where j>i we would repeat all the
            #   comparisons for j<i and create duplicates. We don't want that.
        scores[i, j] = (things[i].compare(things[j]))

# I want the 5 most similar pairs:
n = 5
# This list will contain a tuple for each of the n most similar pairs:
best_list = []
for k in xrange(n):
    ij = np.argmax(scores) # Returns a single integer: ij = i*n + j
    i = ij / N
    j = ij % N
    best_list.append((i, j))
    # Erease this score so that on next iteration the second largest score
    # is found:
    scores[i, j] = 0

for k, (i, j) in enumerate(best_list):
    # The number 1 most similar pair is the BEST match of all.
    # The number N most similar pair is the WORST match of all.
    print "The number %d most similar pair is thing number %d and %d." \
          % (k+1, i, j)
    print "Thing%4d:" % i, \
          things[i].words, things[i].images, things[i].audios, things[i].videos
    print "Thing%4d:" % j, \
          things[j].words, things[j].images, things[j].audios, things[j].videos
    print

Upvotes: 1

Aaron Digulla
Aaron Digulla

Reputation: 328624

If your comparison works with "create a sum of all features and find those which the closest sum", there is a simple trick to get close objects:

  1. Put all objects into an array
  2. Calculate all the sums
  3. Sort the array by sum.

If you take any index, the objects close to it will now have a close index as well. So to find the 5 closest objects, you just need to look at index+5 to index-5 in the sorted array.

Upvotes: 1

Related Questions