Nick
Nick

Reputation: 23

Identifying and labelling partial-duplicate records in R

I have a data frame with thousands (~12k) records, and multiple (36) variables. To provide some context, these are observation of animal behaviour. Twenty of the variables record the presence or absence of particular individuals (as Factors with 2 levels).

What I want to achieve is the creation of an additional variable that labels each unique combination of the presence/absence variables. So, if I have the following simplified dataset, where X & Y record variables of interest (details are not relevant to the question) and three individuals A, B, C (where "ab" = absent and "pr" = present):

Date      Time    X    Y    A   B   C
17-12-01  10:15   2    4    pr  ab  pr
17-12-01  10:45   3    5    pr  ab  pr
17-12-01  11:15   2    4    ab  pr  pr
17-12-01  11:40   2    4    ab  pr  pr
17-12-01  11:15   6    7    ab  pr  ab
...

The possible unique combinations would be:

A    B    C
pr   pr   pr
pr   ab   ab
ab   pr   ab
ab   ab   pr
pr   pr   ab
ab   pr   pr
pr   ab   pr

[edit: the combination of all 'ab' is - in principle - possible, but would not appear as all individuals would be absent - and thus no observation could be made]

[edit:] desired output:

Date      Time    X    Y    A   B   C     Combination
17-12-01  10:15   2    4    pr  ab  pr    1
17-12-01  10:45   3    5    pr  ab  pr    1
17-12-01  11:15   2    4    ab  pr  pr    2
17-12-01  11:40   2    4    ab  pr  pr    2
...                         ab  ab  pr    3    etc.  

I have found that I can create a new data frame selecting the unique combinations using distinct from dplyr:

newDF = distinct(oldDF, A, B, C, .keep_all = TRUE)

but this deletes (or rather fails to include in newDF) the duplicates, whereas what I'd like to do is to keep all duplicates but just label each unique combination (all other posts and advice I can find are concerned with deleting duplicates), and the real data frame has 20 'individual' variables, not just three.

So my questions is: is there a way to use the output from distinct to achieve this, or have I led myself down a blind alley and another approach is needed?

Any help much appreciated.

Nick

Upvotes: 2

Views: 1087

Answers (3)

Uwe
Uwe

Reputation: 42544

The OP has requested to create an additional variable that labels each unique combination of the presence/absence variables. He has supplied a sample dataset with three different variables and a table of possible combinations of the values of the three variables.

To have distinct numbers for the 23 - 1 different combinations, we can use:

dt[, Combination := .GRP, by = .(A, B, C)][]

which numbers the combinations in order of appearance ( .GRP is a special symbol in data.table syntax):

       Date  Time X Y  A  B  C Combination
1: 17-12-01 10:15 2 4 pr ab pr           1
2: 17-12-01 10:45 3 5 pr ab pr           1
3: 17-12-01 11:15 2 4 ab pr pr           2
4: 17-12-01 11:40 2 4 ab pr pr           2
5: 17-12-01 11:15 6 7 ab pr ab           3
6: 17-12-01 11:45 3 4 pr ab pr           1

Note that the sample dataset provided by the OP has been modified to contain a 6th row which has the same Combination number as rows 1 and 2.

Upvotes: 2

tyluRp
tyluRp

Reputation: 4768

One approach with dplyr:

df %>% 
  mutate(id = group_indices(., A, B, C))

      Date  Time X Y  A  B  C id
1 17-12-01 10:15 2 4 pr ab pr  3
2 17-12-01 10:45 3 5 pr ab pr  3
3 17-12-01 11:15 2 4 ab pr pr  2
4 17-12-01 11:40 2 4 ab pr pr  2
5 17-12-01 11:15 6 7 ab pr ab  1
6 17-12-01 11:45 3 4 pr ab pr  3

Upvotes: 1

missuse
missuse

Reputation: 19716

Here is a relatively simple extension of your own approach using dplyr:

tab %>%
  left_join(distinct(tab, A, B, C, .keep_all = TRUE) %>%
              mutate(unique = "1"), suffix = c(".x", ".y"))

you merge the full and distinct data frames labeling the distinct one with a unique label

#output:
      Date  Time X Y  A  B  C unique
1 17-12-01 10:15 2 4 pr ab pr      1
2 17-12-01 10:45 3 5 pr ab pr   <NA>
3 17-12-01 11:15 2 4 ab pr pr      1
4 17-12-01 11:40 2 4 ab pr pr   <NA>
5 17-12-01 11:15 6 7 ab pr ab      1

One can additional code to replace NAs

.... %>%
replace_na(list(unique = 0))

where tab is:

tab = read.table(text = "Date      Time    X    Y    A   B   C
17-12-01  10:15   2    4    pr  ab  pr
17-12-01  10:45   3    5    pr  ab  pr
17-12-01  11:15   2    4    ab  pr  pr
17-12-01  11:40   2    4    ab  pr  pr
17-12-01  11:15   6    7    ab  pr  ab", header = T)

Here is a performance measure:

library(microbenchmark)
microbenchmark(tab %>%
                 left_join(distinct(tab, A, B, C, .keep_all = TRUE) %>%
                             mutate(unique = "1"), suffix = c(".x", ".y")) %>%
                 replace_na(list(unique = 0)),
tab %>%
  group_by(A, B, C) %>% 
  mutate(id = row_number(),
         label = case_when(id >= 2 ~ "duplicate",
                           TRUE ~ "unique")))

#output with 100000 rows: `tab <- tab[sample(1:5, 100000, replace = T),]`

     min        lq     mean   median       uq      max neval
 12.099989 12.564350 13.43444 12.88050 13.44687 34.73189   100
  9.374461  9.928966 11.62848 10.33991 11.57404 49.86249   100

#output with 1 million rows:

      min       lq     mean   median        uq      max neval
 83.26594 85.65350 94.57422 86.24378 105.48224 254.9932   100
 68.81993 88.88253 92.75600 91.22021  93.06835 255.9896   100

so similar performance

Upvotes: 0

Related Questions