Frederik Hansen
Frederik Hansen

Reputation: 506

Grouping of strings

I have a list of possible groups for a list of strings. Each string consists of several words, which are the string elements. I want to group the strings according to these elements.

Each group is based on a common word: all strings in the group must contain that word -- although I don't require that all strings containing that word be in the same group. A string with N words can be in any of N different groups. Each string may be in only one group. Each group must have at least two strings.

Goal: form the groups to maximize the quantity of strings that are in a group (minimize "orphan" strings).

For instance, if I have the following list of strings:

cycle cost
pump cost
cycle analysis
cost example

I would have all possible words of each string as potential groups. I now want to group these strings so that all, or as many as possible, get into a group.

I have tried the naive approach to take the group with the most strings in it first, which in this example would be cost, but this leaves cycle analysis without a group.

The result I'm looking for in this example is:

cycle: cycle cost, cycle analysis
cost: pump cost, cost example

Does there exist an algorithm for this sort of problem already? Any pointers on the approach to take would be helpful.

Upvotes: 0

Views: 145

Answers (1)

Prune
Prune

Reputation: 77847

It looks like @m69 has a good lead. Your problem has a couple of modifications:

  • Remove all sets of size 1;
  • When a set is (tentatively) added to the solution, all elements of that set must be removed from all remaining sets
    • ... and any sets with fewer than two elements get removed.

Unfortunately, this is NP-hard, at best. If the application's input is not ridiculously large, I'd use a brute-force heuristic with liberal backtracking.

Initialization:

  • A viable set is one with at least 2 elements.
  • Process your list.
    • Place all viable sets into a list S.
    • Place all elements into the universe, U.

Process:

  1. Pick a set P from S; add it to the solution list.
  2. Remove all elements of P from each remaining set in S.
  3. Remove all non-viable sets from S.
  4. If S is empty
    • Then If all elements of U exist in S,
      • Then stop and report solution.
      • Else return to previous call level
    • Else [S not empty] Recur on this process with the new S
  5. If recursion reported success,
    • Then return to previous call level.
    • Else go back to step 1, and choose the next set from S

You can gain some advantages in this by judicious ordering of the sets in S. I recommend a greedy algorithm, with value measured according to the desirability of its element in the sets of S. For instance, an element that appears in only one set would shove that set to the top of the list.

Does that get you started?

Upvotes: 2

Related Questions