maxxzi
maxxzi

Reputation: 67

Assign cumulative identifier to unique sets for each ID

I have a dataframe which contains two columns, id column id and an observation column val with observations from a-g in random order.

data.frame(
  id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L),
  val = c("a", "b", "c", "d", "e", "f", "a", "b", "b", "a", "c", "e", "d", "a", "b","c","f","g")
)

   id val
1   1   a
2   1   b
3   1   c
4   1   d
5   1   e
6   2   f
7   2   a
8   2   b
9   3   b
10  3   a
11  3   c
12  3   e
13  3   d
14  4   a
15  4   b
16  4   c
17  4   f
18  4   g

I want to assign an (cumulative) identifier valid for each unique set of observations (belonging to each id) given in column val, like so:

   id val valid
1   1   a     1
2   1   b     1
3   1   c     1
4   1   d     1
5   1   e     1
6   2   f     2
7   2   a     2
8   2   b     2
9   3   b     1
10  3   a     1
11  3   c     1
12  3   e     1
13  3   d     1
14  4   a     3
15  4   b     3
16  4   c     3
17  4   f     3
18  4   g     3

Upvotes: 4

Views: 93

Answers (2)

Maël
Maël

Reputation: 52319

In tidyverse.

  1. Create a list-column of sorted (for comparability) values by id
  2. Group by this new column, and use cur_group_id to create the valid column
  3. Unnest to get the desired output
library(dplyr)
library(tidyr)

df |> 
  reframe(val = list(sort(val)), .by = id) |> 
  mutate(valid = cur_group_id(), .by = val) |> 
  unnest(val)

# # A tibble: 18 × 3
#       id val   valid
#    <int> <chr> <int>
#  1     1 a         1
#  2     1 b         1
#  3     1 c         1
#  4     1 d         1
#  5     1 e         1
#  6     2 a         2
#  7     2 b         2
#  8     2 f         2
#  9     3 a         1
# 10     3 b         1
# 11     3 c         1
# 12     3 d         1
# 13     3 e         1
# 14     4 a         3
# 15     4 b         3
# 16     4 c         3
# 17     4 f         3
# 18     4 g         3

Upvotes: 3

ThomasIsCoding
ThomasIsCoding

Reputation: 102459

You can try the base R option like below

u <- with(df, by(val, id, sort))

dfout <- merge(
    df,
    setNames(
        stack(setNames(match(u, unique(u)), names(u))),
        c("valid", "id")
    ),
    all = TRUE
)

and you will obtain

> dfout
   id val valid
1   1   a     1
2   1   b     1
3   1   c     1
4   1   d     1
5   1   e     1
6   2   f     2
7   2   a     2
8   2   b     2
9   3   b     1
10  3   a     1
11  3   c     1
12  3   e     1
13  3   d     1
14  4   a     3
15  4   b     3
16  4   c     3
17  4   f     3
18  4   g     3

Upvotes: 4

Related Questions