Reputation: 1285
I have 100,000 individuals Using a combination of upper case letters, lower case letters and numbers, I want to create a five-character ID for each individual. I should not have any duplicates. How can I do this? I have tried the code below but I have 4 duplicates.
What is the number of possible unique combinations to create a 5 character ID with "letters"
, "LETTERS"
and "0:9"
?
set.seed(0)
mydata<-data.frame(
ID=rep(NA,10^5),
Poids=rnorm(n=10^5,mean = 65,sd=5)
)
for (i in 1:nrow(mydata)){
mydata$ID[i]<-c(
paste(sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),sep = "")
)
}
table(duplicated(mydata$ID))
FALSE TRUE
99996 4
Upvotes: 2
Views: 673
Reputation: 174278
(length(letters) + length(LETTERS) + length(0:9))^5
is 91,6132,832, so there is plenty of space to avoid clashes.
In fact, we can use this number to help generate our sample. We draw 100,000 integers out of 91,6132,832 without replacement and interpret each number as its unique string of characters using a bit of modular math and indexing. This can all be done in a single pass:
space <- c(LETTERS, letters, 0:9)
set.seed(0)
samps <- sample(length(space)^5, 10^5)
m <- matrix("", nrow = 10^5, ncol = 5)
for(i in seq(ncol(m))) {
m[,i] <- space[(samps %% length(space)) + 1]
samps <- samps %/% length(space)
}
ID <- apply(m, 1, paste, collapse = "")
We can see this fulfils our requirements:
head(ID)
#> [1] "vpdnq" "rK0ej" "ofE9t" "PqLIr" "6G6tu" "Vhc7R"
length(ID)
#> [1] 100000
length(unique(ID))
#> [1] 100000
The whole thing takes less than a second on my modest machine:
user system elapsed
0.72 0.00 0.74
Update
It occurs to me that it is possible to give 100,000 people a unique ID using only 16 characters, i.e. 0-9 and a-f, with code that is much quicker and simpler than above:
set.seed(0)
ID <- as.hexmode(sample(16^5, 10^5))
head(ID)
#> [1] "d43f9" "392a7" "033a2" "cf1d7" "aa10e" "134bb"
length(unique(ID))
#> [1] 100000
Which takes less than 10 milliseconds.
Created on 2022-05-15 by the reprex package (v2.0.1)
Upvotes: 4
Reputation: 145965
If you don't need randomness, the highly performant arrangements
package can help by iterating over the permutations in order, not generating any more than are needed:
library(arrangements)
x = c(letters, LETTERS, 0:9)
ix = ipermutations(x = x, k = 5)
ind = ix$getnext(d = nrow(mydata))
mydata$ID = apply(ind, MAR = 1, FUN = \(i) paste(x[i], collapse = ""))
rbind(head(mydata), tail(mydata))
# ID Poids
# 1 abcde 64.46278
# 2 abcdf 62.00053
# 3 abcdg 75.71787
# 4 abcdh 67.73765
# 5 abcdi 66.45402
# 6 abcdj 66.85561
# 99995 abFpe 56.20545
# 99996 abFpf 64.14443
# 99997 abFpg 70.70191
# 99998 abFph 66.83226
# 99999 abFpi 65.22835
# 100000 abFpj 56.28880
This is quite fast:
user system elapsed
0.194 0.001 0.203
Upvotes: 1
Reputation: 102241
You can try the code below (given N <- 1e5
and k <- 5
):
n <- ceiling(N^(1 / k))
S <- sample(c(LETTERS, letters, 0:9), n)
ID <- head(do.call(paste0, expand.grid(rep(list(S), k))),N)
where
n
gives a subset of the whole space that supports all unique combinations up to given number N
, e.g., N <- 100000
S
denotes a sub-space from which we draw the alphabets or digitsexpand.grid
gives all combinationsUpvotes: 2