Reputation: 61

Group categorical values

Let's say I have three columns, like these:

c1    c2    c3
red   one   low
blue  two   high
red   two   low
red   one   low

I'm trying to figure out an efficient way to create new columns that encode records by similarity. The result of this would be something like this:

c1    c2    c3    d1
red   one   low   1 
blue  two   high
red   two   low 
red   one   low   1

With this generic example, I could vary the number of columns included, to also determine which subsets of columns are alike between records. My guess is that this may involve a k-mode clustering algorithm, though I haven't figured out how to get output like that above, which of course can readily be used as regressors.

Update

From the suggestions so far, the interactions command is by far the most elegant (one line!) and also the most parsimonious because it encodes all the groups numerically in a single column. Kudos to thelatemail for that suggestion, and thanks to everyone else for your other approaches.

Let me make this more challenging, and actually closer to what I need. Taking the example above, I could also use partial matching. What if two columns agree? The output might look (sort of) like this

c1    c2    c3    g1   g2
red   one   low   1    2
blue  two   high
red   two   low   2
red   one   low   1    2

Along with weights for the groups:

group   weight
1       3
2       2

In this example, group 1 has weight 3 because all factors agreed, while group 2 has weight 2 because only 2 out of 3 factors agreed.

The format of the returned groups and weights for the groups obviously might look a bit different.

Upvotes: 1

Answers (3)

thelatemail

Reputation: 93938

Maybe there's more complexity I'm missing, but you could just use ?interaction to label the matching records:

dat$group <- as.numeric(interaction(dat,drop=TRUE))
#    c1  c2   c3 group
#1  red one  low     2
#2 blue two high     1
#3  red two  low     3
#4  red one  low     2

Upvotes: 1

StefanK

Reputation: 2180

I like the expand grid answer by rsoren, but there are higher numbers in the code column than there are number of rows of original data.

If you want to start with 1 as the lowest unique identifier and add numbers consecutively, you need to adjust it a little.

This is your table

original_data <- read.table(
  header = TRUE, 
  text = "
  c1    c2    c3
  red   one   low
  blue  two   high
  red   two   low
  red   one   low")

specify which rows you will use, for example the first two:

unique_data <- unique(original_data[,1:2])
unique_data$ID <- 1:nrow(unique_data)

And merge the tables in the end:

final_data <- merge(original_data, unique_data)

Upvotes: 2

rsoren

Reputation: 4216

Here's one way. Make a new data frame with each unique combination of the variables, give each combination a code, and merge it back on to the original

tmp <- read.table(
  header = TRUE, 
  text = "
  c1    c2    c3
  red   one   low
  blue  two   high
  red   two   low
  red   one   low")

tmp2 <- expand.grid(
  c1 = unique(tmp$c1),
  c2 = unique(tmp$c2),
  c3 = unique(tmp$c3)
)

tmp2$code <- 1:nrow(tmp2)

merge(tmp, tmp2, all.x = TRUE)

Upvotes: 1

Group categorical values

Answers (3)

Related Questions