anjruu
anjruu

Reputation: 1264

Algorithm for Minimizing Number of Large Loads

I have an algorithm where I need to compute all pair-wise distances between large objects (where the distance is a true metric, so dist(a,b) == dist(b,a), among all the other metric requirements). I have around a thousand objects, so I need to calculate about 500,000 distances. There are a few issues:

  1. All of these 1000 objects are serialized, sitting in individual files on the disk.
  2. Reading them from disk is a huge operation compared to the relatively trivial distance calculation.
  3. I can't have all of them in memory at the same time without swapping, and then thrashing. I can fit about 500 in memory at a time.
  4. This means that at some point I will need to re-read an object into memory, after having read and released the memory at some point previously.

So given that reading from disk is the bottleneck, and that I can't read in more than half at a time, can anyone think of an algorithm for reading and releasing these objects which minimizes the total number of reads?

I considered reading in the first half, doing all of those pair-wise distances, releasing all of that memory and reading the second half, and then doing all of those pair-wise distances. At this point, I still need the distances between the objects in the first half and the objects in the second, and I'm not sure what to do. I also considered having a cache that, when full, randomly chooses a present object to evict and reads the next one, but I feel that there has to be a better option. I considered something like LRU, but it can lead to behavior where the object required was evicted on the last calculation.

All-in-all, I'm kinda stuck. Any advice?

Upvotes: 0

Views: 91

Answers (3)

David Eisenstat
David Eisenstat

Reputation: 65458

Load the points in the first half. Compute pairwise distances. Keeping the first half in memory, load the points in the second half one by one, computing distances. Load the entire second half and compute pairwise distances. This is about 1.5 reads per point, which is about optimal in the following model.

The model is that a syntactically correct program for this task is a sequence of instructions of the form LOAD(i, j), whose meaning is to load point i into memory location j. The program is correct if, for every pair of points, there exists an instruction immediately after which both points are in memory.

It is clear that, in a correct program, every point is loaded at least once. Assuming exactly 1000 points and 500 locations, it follows that there are at least 500 evictions in favor of points being loaded for the first time. Assume without loss of generality that no point ever is duplicated in memory. After a point p is evicted in favor of a point q being loaded for the first time, it is necessary to reload p in order for the distance between p and q to be computed. Therefore, there are at least 500 reloads and hence at least 1500 loads.

Upvotes: 3

Seblis
Seblis

Reputation: 337

I'll try to slightly improve @Dukeling answer. If I understand correctly, this will be better

i j
1 2
  3
  4
2 
3
  2

And now you have (1+3+2+1)/4=1.75 reads per point.

Upvotes: 1

Bernhard Barker
Bernhard Barker

Reputation: 55609

My first thought (so it might not be the best one):

For each i:

  • Read i-th group of 250. Calculate distances between these.

  • For each remaining group j of 250:

    • Read the j-th group.

    • Calculate distances between the points in the i-th and the j-th groups.

    • Discard group j.

  • Discard group i.

So you'd read the following groups:

i  j
1  2
   3
   4
2  3
   4
3  4
4

You'll notice that reading group 4 by itself is rather pointless - you can just calculate the distances when reading the 3rd (or some other) group.

So you end with an average of (1+2+3+3)/4 = 2.25 reads per point.

Upvotes: 2

Related Questions