Reputation: 457
Problem: I need to make a unique ID field for data that has two levels of grouping. In the example code here, it is Emp
and Color
. The ID needs to be structured as:
Emp
+ unique number of each Color
+ sequential number for duplicated Colors
.
These values are separated by periods.
Example data:
dat <- data.frame(Emp = c("A","A","A","B","B","C"),
Color = c("Red","Green","Green","Orange","Yellow","Brown"),
stringsAsFactors = FALSE)
The ID is supposed to appear as this:
ID <- c("A.01.001", "A.02.001", "A.02.002", "B.01.001", "B.02.001", "C.01.001")
ID [1] "A.01.001" "A.02.001" "A.02.002" "B.01.001" "B.02.001" "C.01.001"
The three character suffix to the ID to record the duplicates can be done as:
group_by(dat, Emp, Color) %>%
mutate(suffix = str_pad(row_number(), width=3, side="left", pad="0"))
But I am unable to assign sequential numbers to the unique occurrence of Color
with each Emp
group.
I prefer a dplyr solution, but any method would be appreciated.
Upvotes: 3
Views: 2699
Reputation: 83275
Using data.table
and sprintf
:
library(data.table)
setDT(dat)[, ID := sprintf('%s.%02d.%03d',
Emp, rleid(Color), rowid(rleid(Color))),
by = Emp]
you get:
> dat
Emp Color ID
1: A Red A.01.001
2: A Green A.02.001
3: A Green A.02.002
4: B Orange B.01.001
5: B Yellow B.02.001
6: C Brown C.01.001
How this works:
dat
to a data.table
with setDT()
Emp
.ID
-variable with the sprintf
-function. With sprintf
you paste several vector easily together according to a specified format.:=
means that the data.table
is updated by reference.%s
indicates that a string is to be used in the first part (which is Emp
). %02d
& %03d
indicates that a number needs to have two or three digits with a leading zero(s) when needed. The dots in between will taken literally and thus in cluded in the resulting string.Adressing the comment of @jsta, if the values in the Color
-column are not sequential you can use:
setDT(dat)[, r := as.integer(factor(Color, levels = unique(Color))), by = Emp
][, ID := sprintf('%s.%02d.%03d',
Emp, r, rowid(r)),
by = Emp][, r:= NULL]
This will also maintain the order in which the Color
column is presented. Instead of as.integer(factor(Color, levels = unique(Color)))
you can also use match(Color, unique(Color))
as shown by akrun.
Implementing the above on a bit larger dataset to illustrate:
dat2 <- rbindlist(list(dat,dat))
dat2[, r := match(Color, unique(Color)), by = Emp
][, ID := sprintf('%s.%02d.%03d',
Emp, r, rowid(r)),
by = Emp]
gets you:
> dat2
Emp Color r ID
1: A Red 1 A.01.001
2: A Green 2 A.02.001
3: A Green 2 A.02.002
4: B Orange 1 B.01.001
5: B Yellow 2 B.02.001
6: C Brown 1 C.01.001
7: A Red 1 A.01.002
8: A Green 2 A.02.003
9: A Green 2 A.02.004
10: B Orange 1 B.01.002
11: B Yellow 2 B.02.002
12: C Brown 1 C.01.002
Upvotes: 4
Reputation: 887981
We can try
dat %>%
group_by(Emp) %>%
mutate(temp = match(Color, unique(Color)),
temp2 = duplicated(Color)+1,
ID = sprintf("%s.%02d.%03d", Emp, temp, temp2))%>%
select(-temp, -temp2)
# Emp Color ID
# <chr> <chr> <chr>
#1 A Red A.01.001
#2 A Green A.02.001
#3 A Green A.02.002
#4 B Orange B.01.001
#5 B Yellow B.02.001
#6 C Brown C.01.001
Upvotes: 4