dvs23
dvs23

Reputation: 137

Create sets of minimal cardinality from set of pairs

I have a set of pairs of IDs like

(123;1765)
(1212;8977)...

I need to separate those pairs into n groups with an inividual size (number of pairs) each. Those sets should have minimum cardinality (=there should be as few as possible different ids in each group). Are there any existing algorithms which solve this problem? I'm not sure where/how to search for it. This is necessary, because I currently work on the load balancing of one of my projects and each node should have to load as few IDs as possible because of limited RAM (each ID is connected to a larger dataset).

Edit:
Some background: Different nodes in a cluster have to compare datasets identified by IDs. Each comparison is a pair of IDs (compare dataset of ID1 with ID2). Each node gets a bunch of pairs to know which IDs it has to compare and loads the corresponding datasets into RAM. A master node divides a big bunch of pairs into smaller bunches and distributes them to the slave nodes. Because each node can only store a limited amount of datasets, those smaller bunches need to contain as few different IDs as possible. But the nodes have different amounts of RAM, so the groups with minimal cardinality should have different sizes. The comparison is symmetric, so compare(ID1, ID2) is the same as compare(ID2, ID1), so each pair is unique. Which datasets need to be compared is degined by a client which sents those jobs to the master as a bunch of pairs of IDs.

An example: A client wants the comparison of dataset (1;2), (7;9), (9;105), (7;105), (2;4), (4;1) (usually here should be much more comparisons, so millions usually) The client sends those pairs to the master, which has two registered slaves. Now the master needs to divide that stack of work into two groups, but the more different IDs are part of each group the more datasets need to be loaded by the slaves (ID corresponds to specific dataset, remember?).

So ideally the master would create a group like ((1;2), (2;4), (4;1)) (only contains 3 different IDs, so the slave only has to load 3 datasets) and ((7;9), (9;105), (7; 105)) (again just three IDs) instead of: ((1;2), (9;105)...) and ((2;4), (7;105)...). Here both slaves need to load 4 IDs and more, and e.g. both slaves need to load the datasets no. 2 and 105. This needs to be optimized somehow..

Upvotes: 2

Views: 155

Answers (1)

pwilcox
pwilcox

Reputation: 5753

My first instinct is to say that perhaps this could be resolved with a special cluster analysis where you customize the aggregation and distance functions.

  • The cluster members would be pairs.
  • The cluster aggregate would be the set-theoretical union of all pairs in the cluster (this is instead of an average or median in the standard approach).
  • The distance function of any pair in comparison to the cluster would be the number of elements in the pair that are not found in the cluster aggregate (so the cardinality of the set difference; this replaces the Euclidean distance in the standard approach).
  • Some cluster algorithms have you set the number of desired clusters in advance, so you would set it to two.
  • And finally, because you need to balance things so that the cluster aggregates have the same number of elements, further tweaking, but still doable.

But, you say you will have millions of points to compare. The processing required for cluster analysis increases exponentially the more input you put in. In this situation, it is worth researching whether your problem is NP or NP-complete. I'm not well versed in that, but I suspect it is, in which case a true optimum will always escape you.

But, if you discover that your problem is in fact NP-complete, then you can still optimize, you just won't be able to guarantee arrival at the global optimum in a reasonable amount of time. So, for instance, you can break your set of pairs into subsets and run an algorithm such as above on the subsets. That may still be an improvement.

Upvotes: 2

Related Questions