Reputation: 401
I would like to assign unique IDs to rows of a data table per multiple column values. Let's consider a simple example:
library(data.table)
DT = data.table(a=c(4,2,NA,2,NA), b=c("a","b","c","b","c"), c=1:5)
a b c
1: 4 a 1
2: 2 b 2
3: NA c 3
4: 2 b 4
5: NA c 5
I'd like to generate IDs based on columns a and b and expect to get three IDs where 2nd and 4th row IDs are identical and 3rd and 5th rows have the same ID as well.
I have seen two solutions but each are slightly incomplete:
1) Solution one requires sorted data table which is very cumbersome if we need to generate IDs per many columns (in my real application, IDs are generated based on about ten columns). Can we replace cumsum function so no sorting is required?
DT$ID1 <- cumsum(!duplicated(DT[,1:2]))
2) Solution two ignores NA values; while I'd like to include NAs and assign a group ID to them
DT <- transform(DT, ID2 = as.numeric(interaction(a,b, drop=TRUE)))
I appreciate any suggestion on how to modify either of the solutions to generate the Expected_ID shown below.
a b c ID1 ID2 Expected_ID
1: 4 a 1 1 1 1
2: 2 b 2 2 2 2
3: NA c 3 3 NA 3
4: 2 b 4 3 2 2
5: NA c 5 3 NA 3
Upvotes: 5
Views: 3766
Reputation: 66819
The idiomatic way:
DT[, g := .GRP, by=.(a,b)]
a b c g
1: 4 a 1 1
2: 2 b 2 2
3: NA c 3 3
4: 2 b 4 2
5: NA c 5 3
There's reason to believe this won't be fast, but it turns out it's not too bad compared to competing approaches:
nv = 10
nu = 3
nr = 1e6
library(data.table)
set.seed(1)
DT = do.call(CJ, rep(list(seq_len(nu)), nv))[sample(1:.N, nr, replace=TRUE)]
cols = copy(names(DT))
# "idiomatic" .GRP
system.time(DT[, g := .GRP, by=cols])
# user system elapsed
# 0.23 0.02 0.25
# sort and count runs
oi = as.call(lapply(c("order", cols), as.name))
system.time(DT[eval(oi), go := rleidv(.SD, cols)])
# user system elapsed
# 0.3 0.0 0.3
# paste 'em
system.time(DT[, gp := match(p <- do.call(paste, c(.SD, list(sep="_"))), unique(p)), .SDcols=cols])
# user system elapsed
# 5.26 0.06 5.32
# paste 'em, fact'em (@akrun's answer)
system.time(DT[, gpf := as.integer(factor(p <- do.call(paste, c(.SD, list(sep="_"))), levels = unique(p))), .SDcols=cols])
# user system elapsed
# 4.74 0.08 4.82
# check
identical(DT$g, DT$gp); identical(DT$g, DT$gpf)
uniqueN(DT, "g") == uniqueN(DT, c("g", "go"))
The rleidv way creates different group numbers, but effects the same grouping.
Increasing the size of the problem to nr = 5e7
raised the time to 8s for the .GRP
approach; 20s for the rleidv way; and led R to hang for the others on my system.
For anyone interested, more approaches can be found in the R FAQ How to create a consecutive index based on a grouping variable in a dataframe
Upvotes: 14
Reputation: 886948
We can use
DT[, Expected_ID := as.numeric(factor(paste(a, b), levels = unique(paste(a, b))))]
Upvotes: 0