Reputation: 7464
I need to combine multiple columns together to get a single "grouping" variable as in the Paste multiple columns together thread. The problem is, I want it to be robust to similar content of the strings, e.g.
tmp1 <- data.frame(V1 = c("a", "aa", "a", "b", "bb", "aa"),
V2 = c("a", "a", "aa", "b", "b", "a"))
tmp2 <- data.frame(V1 = c("+", "++", "+-", "-|", "||"),
V2 = c("-|", "--", "++", "|-+", "|"))
For the data as above, using function apply(x, 1, paste, collapse = sep)
with some common separators like ""
, |
, -
, +
would fail as it would make the columns unidentifiable in output and may lead to mixing together different kinds of columns.
The columns can be assumed to be of different types (numeric, factor, character etc.).
The expected output is a vector with one ID per row, where each ID is assigned to unique combination of values between the two columns. The actual form of the ID's is not important for me. For example,
1 2 3 4 5 2
for the tmp1
data.
Can you suggest a better way to do this? Please notice that I am concerned with performance.
Upvotes: 0
Views: 60
Reputation: 3230
Based on the update to your question, if the form the ID doesn't matter, this is easy. Here is a method using data.table
, you can do similar with dplyr
.
library(data.table)
merge(tmp1,
unique(tmp1)[, .(V1, V2, ID = 1:.N)],
by = c("V1", "V2"))
V1 V2 ID
1: a a 1
2: a aa 3
3: aa a 2
4: aa a 2
5: b b 4
6: bb b 5
The second parameter of the merge subsets only the unique combinations and assigns a unique value to each unique row, and then the merge brings it back to the full dataset.
Upvotes: 1