Ben L
Ben L

Reputation: 1321

Is there an standard algorithm for finding the union of list items

I'm not sure what to call this algorithm so that's part of the reason I'm not not sure it if exists already.

(note: I'm using some macro's to simplify the input for illustration purposes)

std::vector<BYTE> itemlist;

itemlist.push_back( 1000 );
itemlist.push_back( 0001 );
itemlist.push_back( 0001 );
itemlist.push_back( 0010 );
itemlist.push_back( 0100 );
itemlist.push_back( 0011 );
itemlist.push_back( 0001 );
itemlist.push_back( 1000 );

std::list<std::vector<int>> results = Confab(itemlist);

expected results:

{ [ 1,2,3,5,6], [4], [0,7] }

So I'm using the BYTE as a bit mask to find the union of the members. The key is I'm also updating the mask as well. So items 1 and 2 have nothing in common, but item 5 merges with either.

I'm looking for any insights about how to do this in the least number of passes. In the real world I'm using DWORDs.

edit I forgot to mention that any item with 0000 means the item gets its own entry.

Upvotes: 3

Views: 306

Answers (2)

Bernhard Barker
Bernhard Barker

Reputation: 55619

Start off with an array of list pointers equal to the number of bits you have, consisting of null pointers to start off with.

For each element:

  • If none of the array elements corresponding to the bits are set, create a new list and set all the array elements to point to the list.

  • Otherwise, for each initialized array element, merge the lists, making the pointers point to the new list. Also set all uninitialized array elements to the new list.

  • Add an auto-incremented value to the new list.

Go through the array, outputting all unique lists (to only process unique lists, you can set some indicator once you've processed the list, so you don't process it twice).

Oh right, we also need union-find here - otherwise merging the lists may leave some other elements pointing to the old list.

A bit more on the union-find algorithm:

This algorithm uses a forest (a bunch of tree data structures), but instead of the standard tree format where parents point to children, the children instead point to their parent, and a parent can have any number of children. We start off with a tree containing just a single element for each set (so, in our case, whenever we create a new list, we'd create a new set). The root can be thought of an indicator of the set which this entire tree belongs to. To merge two sets, we make one of the roots point to the other root, so the trees would be combined under the other root.

In our case, we'd have each root also store a list of elements. Whenever we want to merge lists as specified above, we'd first iterate up each tree to see which set it belongs to, and merge the sets (and their lists) if they belong to different sets, or do nothing otherwise.

This is just a basic description union-find, you may need to read up on it some more if you're struggling to understand some of the complexities here.

Running time analysis:

You can merge lists in constant time (see list::splice). For 32 bits, there can never be more than 31 merges in total, as there can never be more sets, thus lists, than the number of bits, so you'd just mostly be doing checking.

The added complexity of the union-find algorithm should be negligible here, especially if you use some well-known optimizations.

This (if implemented correctly) is linear in the number of bits times the number of input elements, which you can't do better than (asymptotically speaking), as you at least need to look at every bit of every element in the input.

Example: (not showing union-find, for simplicity)

4 bits, so an array of 4 lists.

[NULL,NULL,NULL,NULL]

Process 1000. The first bit isn't set, so create a new list containing 0 and make that element point to it.

 [0]
  ^
  |
[   ,NULL,NULL,NULL]

Process 0001. The last bit isn't set, so create a new list containing 1 and make that element point to it.

 [0]           [1]
  ^             ^
  |             |
[   ,NULL,NULL,   ]

Process 0001. The last bit is set, so just add 2.

 [0]           [1,2]
  ^             ^
  |             |
[   ,NULL,NULL,   ]

Process 0010. The second-last bit isn't set, so create a new list containing 3 and make that element point to it.

 [0]      [3] [1,2]
  ^        ^   ^
  |        |   |
[   ,NULL,   ,   ]

Process 0100. The second bit isn't set, so create a new list containing 4 and make that element point to it.

 [0] [4] [3] [1,2]
  ^   ^   ^   ^
  |   |   |   |
[   ,   ,   ,   ]

Process 0011. The second-last and last bits are set, so merge them and add 5.

 [0] [4] [1,2,3,5]
  ^   ^   ^   ^
  |   |   |   |
[   ,   ,   ,   ]

Process 0001. The last bit is set, so just add 6.

 [0] [4] [1,2,3,5,6]
  ^   ^   ^   ^
  |   |   |   |
[   ,   ,   ,   ]

Process 1000. The first bit is set, so just add 7.

 [0,7] [4] [1,2,3,5,6]
  ^     ^   ^   ^
  |     |   |   |
[   ,     ,   ,   ]

Now we have our 3 lists, as required.

Upvotes: 2

Łukasz Kidziński
Łukasz Kidziński

Reputation: 1623

A 'disjoint-set' is quite a general approach (http://en.wikipedia.org/wiki/Disjoint-set_data_structure). Some adaptation can be used here as well.

One possible approach:

  • Consider a list A = {1000, 0100, 0010, 0001} - each element represents a set and at the beginning all sets are disjoint.
  • For each element v in your list (like 1010) find corresponding sets in A (all e in A s.t. e & v != 0).
  • Join those elements e1,e2,...: Remove them from A and add e1|e2|... to A.

Finally, A represents all sets. You can classify each item e by finding an a in A s.t. a & e == e.

Upvotes: 0

Related Questions