Reputation: 425
I am trying to group together parts of a data set that I am working with. I have a group of individuals that work with a variety of different skills. The idea is to get the largest pct of agents and skills represented.
So in a perfect scenario, it would be nice to get a sample of agents that comprise 85-90% of the records along with a group of skills that represent 85-90% of records too. Basically, I want to obtain the largest percent sample without having small groups of agents that work with only a few skills or have skills that only a very small pct of agents work with.
I am trying to find a more statistical approach to doing this and thought about clustering. But from my understanding, clustering requires a distance definition. I am not sure that that this data would fit this requirement.
Below is a small sample of what the data looks like:
Agent Skill
1 Claims
1 Benefits
2 Claims
2 -
3 Other
Upvotes: 0
Views: 249
Reputation: 77454
You are looking at the wrong tools for this problem.
What you are trying to do is a variant of the set cover problem, not clustering.
Except that you are not looking for a minmal cover, but an approximative upper cover.
You'll need to decide when a solution is better than another. Your description of this is too vague - it allows the trivial solution of keeping everything: 100% cover.
Then repeatedly try to either:
depending on what yields the best improvement.
But again, you need to have a formal quality criterion.
Upvotes: 2