Tim
Tim

Reputation: 7464

Combining columns to preserve uniqueness

I need to combine multiple columns together to get a single "grouping" variable as in the Paste multiple columns together thread. The problem is, I want it to be robust to similar content of the strings, e.g.

tmp1 <- data.frame(V1 = c("a", "aa", "a",  "b", "bb", "aa"),
                   V2 = c("a", "a",  "aa", "b", "b",  "a"))

tmp2 <- data.frame(V1 = c("+",  "++", "+-", "-|",  "||"),
                   V2 = c("-|", "--", "++", "|-+", "|"))

For the data as above, using function apply(x, 1, paste, collapse = sep) with some common separators like "", |, -, + would fail as it would make the columns unidentifiable in output and may lead to mixing together different kinds of columns.

The columns can be assumed to be of different types (numeric, factor, character etc.).

The expected output is a vector with one ID per row, where each ID is assigned to unique combination of values between the two columns. The actual form of the ID's is not important for me. For example,

1 2 3 4 5 2

for the tmp1 data.

Can you suggest a better way to do this? Please notice that I am concerned with performance.

Upvotes: 0

Views: 60

Answers (1)

Eric Watt
Eric Watt

Reputation: 3230

Based on the update to your question, if the form the ID doesn't matter, this is easy. Here is a method using data.table, you can do similar with dplyr.

library(data.table)

merge(tmp1,
      unique(tmp1)[, .(V1, V2, ID = 1:.N)],
      by = c("V1", "V2"))

   V1 V2 ID
1:  a  a  1
2:  a aa  3
3: aa  a  2
4: aa  a  2
5:  b  b  4
6: bb  b  5

The second parameter of the merge subsets only the unique combinations and assigns a unique value to each unique row, and then the merge brings it back to the full dataset.

Upvotes: 1

Related Questions