user3120266
user3120266

Reputation: 425

Clustering non-numeric groups

I am trying to group together parts of a data set that I am working with. I have a group of individuals that work with a variety of different skills. The idea is to get the largest pct of agents and skills represented.

So in a perfect scenario, it would be nice to get a sample of agents that comprise 85-90% of the records along with a group of skills that represent 85-90% of records too. Basically, I want to obtain the largest percent sample without having small groups of agents that work with only a few skills or have skills that only a very small pct of agents work with.

I am trying to find a more statistical approach to doing this and thought about clustering. But from my understanding, clustering requires a distance definition. I am not sure that that this data would fit this requirement.

Below is a small sample of what the data looks like:

      Agent          Skill
        1            Claims
        1            Benefits
        2            Claims
        2              -
        3            Other

Upvotes: 0

Views: 249

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

You are looking at the wrong tools for this problem.

What you are trying to do is a variant of the set cover problem, not clustering.

Except that you are not looking for a minmal cover, but an approximative upper cover.

You'll need to decide when a solution is better than another. Your description of this is too vague - it allows the trivial solution of keeping everything: 100% cover.

Then repeatedly try to either:

  • remove an agent
  • remove a skill

depending on what yields the best improvement.

But again, you need to have a formal quality criterion.

Upvotes: 2

Related Questions