Reputation: 23
I have a data frame with thousands (~12k) records, and multiple (36) variables. To provide some context, these are observation of animal behaviour. Twenty of the variables record the presence or absence of particular individuals (as Factors with 2 levels).
What I want to achieve is the creation of an additional variable that labels each unique combination of the presence/absence variables. So, if I have the following simplified dataset, where X & Y record variables of interest (details are not relevant to the question) and three individuals A, B, C (where "ab" = absent and "pr" = present):
Date Time X Y A B C
17-12-01 10:15 2 4 pr ab pr
17-12-01 10:45 3 5 pr ab pr
17-12-01 11:15 2 4 ab pr pr
17-12-01 11:40 2 4 ab pr pr
17-12-01 11:15 6 7 ab pr ab
...
The possible unique combinations would be:
A B C
pr pr pr
pr ab ab
ab pr ab
ab ab pr
pr pr ab
ab pr pr
pr ab pr
[edit: the combination of all 'ab' is - in principle - possible, but would not appear as all individuals would be absent - and thus no observation could be made]
[edit:] desired output:
Date Time X Y A B C Combination
17-12-01 10:15 2 4 pr ab pr 1
17-12-01 10:45 3 5 pr ab pr 1
17-12-01 11:15 2 4 ab pr pr 2
17-12-01 11:40 2 4 ab pr pr 2
... ab ab pr 3 etc.
I have found that I can create a new data frame selecting the unique combinations using distinct
from dplyr
:
newDF = distinct(oldDF, A, B, C, .keep_all = TRUE)
but this deletes (or rather fails to include in newDF
) the duplicates, whereas what I'd like to do is to keep all duplicates but just label each unique combination (all other posts and advice I can find are concerned with deleting duplicates), and the real data frame has 20 'individual' variables, not just three.
So my questions is: is there a way to use the output from distinct
to achieve this, or have I led myself down a blind alley and another approach is needed?
Any help much appreciated.
Nick
Upvotes: 2
Views: 1087
Reputation: 42544
The OP has requested to create an additional variable that labels each unique combination of the presence/absence variables. He has supplied a sample dataset with three different variables and a table of possible combinations of the values of the three variables.
To have distinct numbers for the 23 - 1 different combinations, we can use:
dt[, Combination := .GRP, by = .(A, B, C)][]
which numbers the combinations in order of appearance ( .GRP
is a special symbol in data.table
syntax):
Date Time X Y A B C Combination 1: 17-12-01 10:15 2 4 pr ab pr 1 2: 17-12-01 10:45 3 5 pr ab pr 1 3: 17-12-01 11:15 2 4 ab pr pr 2 4: 17-12-01 11:40 2 4 ab pr pr 2 5: 17-12-01 11:15 6 7 ab pr ab 3 6: 17-12-01 11:45 3 4 pr ab pr 1
Note that the sample dataset provided by the OP has been modified to contain a 6th row which has the same Combination
number as rows 1 and 2.
Upvotes: 2
Reputation: 4768
One approach with dplyr
:
df %>%
mutate(id = group_indices(., A, B, C))
Date Time X Y A B C id
1 17-12-01 10:15 2 4 pr ab pr 3
2 17-12-01 10:45 3 5 pr ab pr 3
3 17-12-01 11:15 2 4 ab pr pr 2
4 17-12-01 11:40 2 4 ab pr pr 2
5 17-12-01 11:15 6 7 ab pr ab 1
6 17-12-01 11:45 3 4 pr ab pr 3
Upvotes: 1
Reputation: 19716
Here is a relatively simple extension of your own approach using dplyr:
tab %>%
left_join(distinct(tab, A, B, C, .keep_all = TRUE) %>%
mutate(unique = "1"), suffix = c(".x", ".y"))
you merge the full and distinct data frames labeling the distinct one with a unique label
#output:
Date Time X Y A B C unique
1 17-12-01 10:15 2 4 pr ab pr 1
2 17-12-01 10:45 3 5 pr ab pr <NA>
3 17-12-01 11:15 2 4 ab pr pr 1
4 17-12-01 11:40 2 4 ab pr pr <NA>
5 17-12-01 11:15 6 7 ab pr ab 1
One can additional code to replace NAs
.... %>%
replace_na(list(unique = 0))
where tab is:
tab = read.table(text = "Date Time X Y A B C
17-12-01 10:15 2 4 pr ab pr
17-12-01 10:45 3 5 pr ab pr
17-12-01 11:15 2 4 ab pr pr
17-12-01 11:40 2 4 ab pr pr
17-12-01 11:15 6 7 ab pr ab", header = T)
Here is a performance measure:
library(microbenchmark)
microbenchmark(tab %>%
left_join(distinct(tab, A, B, C, .keep_all = TRUE) %>%
mutate(unique = "1"), suffix = c(".x", ".y")) %>%
replace_na(list(unique = 0)),
tab %>%
group_by(A, B, C) %>%
mutate(id = row_number(),
label = case_when(id >= 2 ~ "duplicate",
TRUE ~ "unique")))
#output with 100000 rows: `tab <- tab[sample(1:5, 100000, replace = T),]`
min lq mean median uq max neval
12.099989 12.564350 13.43444 12.88050 13.44687 34.73189 100
9.374461 9.928966 11.62848 10.33991 11.57404 49.86249 100
#output with 1 million rows:
min lq mean median uq max neval
83.26594 85.65350 94.57422 86.24378 105.48224 254.9932 100
68.81993 88.88253 92.75600 91.22021 93.06835 255.9896 100
so similar performance
Upvotes: 0