Reputation: 506
I have a list of possible groups for a list of strings. Each string consists of several words, which are the string elements. I want to group the strings according to these elements.
Each group is based on a common word: all strings in the group must contain that word -- although I don't require that all strings containing that word be in the same group. A string with N words can be in any of N different groups. Each string may be in only one group. Each group must have at least two strings.
Goal: form the groups to maximize the quantity of strings that are in a group (minimize "orphan" strings).
For instance, if I have the following list of strings:
cycle cost
pump cost
cycle analysis
cost example
I would have all possible words of each string as potential groups. I now want to group these strings so that all, or as many as possible, get into a group.
I have tried the naive approach to take the group with the most strings in it first, which in this example would be cost
, but this leaves cycle analysis
without a group.
The result I'm looking for in this example is:
cycle: cycle cost, cycle analysis
cost: pump cost, cost example
Does there exist an algorithm for this sort of problem already? Any pointers on the approach to take would be helpful.
Upvotes: 0
Views: 145
Reputation: 77847
It looks like @m69 has a good lead. Your problem has a couple of modifications:
Unfortunately, this is NP-hard, at best. If the application's input is not ridiculously large, I'd use a brute-force heuristic with liberal backtracking.
Initialization:
Process:
You can gain some advantages in this by judicious ordering of the sets in S. I recommend a greedy algorithm, with value measured according to the desirability of its element in the sets of S. For instance, an element that appears in only one set would shove that set to the top of the list.
Does that get you started?
Upvotes: 2