Reputation: 61
Let's say I have three columns, like these:
c1 c2 c3
red one low
blue two high
red two low
red one low
I'm trying to figure out an efficient way to create new columns that encode records by similarity. The result of this would be something like this:
c1 c2 c3 d1
red one low 1
blue two high
red two low
red one low 1
With this generic example, I could vary the number of columns included, to also determine which subsets of columns are alike between records. My guess is that this may involve a k-mode clustering algorithm, though I haven't figured out how to get output like that above, which of course can readily be used as regressors.
Update
From the suggestions so far, the interactions command is by far the most elegant (one line!) and also the most parsimonious because it encodes all the groups numerically in a single column. Kudos to thelatemail for that suggestion, and thanks to everyone else for your other approaches.
Let me make this more challenging, and actually closer to what I need. Taking the example above, I could also use partial matching. What if two columns agree? The output might look (sort of) like this
c1 c2 c3 g1 g2
red one low 1 2
blue two high
red two low 2
red one low 1 2
Along with weights for the groups:
group weight
1 3
2 2
In this example, group 1 has weight 3 because all factors agreed, while group 2 has weight 2 because only 2 out of 3 factors agreed.
The format of the returned groups and weights for the groups obviously might look a bit different.
Upvotes: 1
Views: 3402
Reputation: 93938
Maybe there's more complexity I'm missing, but you could just use ?interaction
to label the matching records:
dat$group <- as.numeric(interaction(dat,drop=TRUE))
# c1 c2 c3 group
#1 red one low 2
#2 blue two high 1
#3 red two low 3
#4 red one low 2
Upvotes: 1
Reputation: 2180
I like the expand grid answer by rsoren, but there are higher numbers in the code column than there are number of rows of original data.
If you want to start with 1 as the lowest unique identifier and add numbers consecutively, you need to adjust it a little.
This is your table
original_data <- read.table(
header = TRUE,
text = "
c1 c2 c3
red one low
blue two high
red two low
red one low")
specify which rows you will use, for example the first two:
unique_data <- unique(original_data[,1:2])
unique_data$ID <- 1:nrow(unique_data)
And merge the tables in the end:
final_data <- merge(original_data, unique_data)
Upvotes: 2
Reputation: 4216
Here's one way. Make a new data frame with each unique combination of the variables, give each combination a code, and merge it back on to the original
tmp <- read.table(
header = TRUE,
text = "
c1 c2 c3
red one low
blue two high
red two low
red one low")
tmp2 <- expand.grid(
c1 = unique(tmp$c1),
c2 = unique(tmp$c2),
c3 = unique(tmp$c3)
)
tmp2$code <- 1:nrow(tmp2)
merge(tmp, tmp2, all.x = TRUE)
Upvotes: 1