Seydou GORO
Seydou GORO

Reputation: 1285

How to create a unique identifier for 100000 with 5 characters?

I have 100,000 individuals Using a combination of upper case letters, lower case letters and numbers, I want to create a five-character ID for each individual. I should not have any duplicates. How can I do this? I have tried the code below but I have 4 duplicates.

What is the number of possible unique combinations to create a 5 character ID with "letters", "LETTERS" and "0:9"?

set.seed(0)
    
    mydata<-data.frame(
      ID=rep(NA,10^5),
      Poids=rnorm(n=10^5,mean = 65,sd=5)
    )
    
    
    for (i in 1:nrow(mydata)){
      
      mydata$ID[i]<-c(
        paste(sample(c(0:9,LETTERS,letters),replace = F,size = 1),             
              sample(c(0:9,LETTERS,letters),replace = F,size = 1),  
              sample(c(0:9,LETTERS,letters),replace = F,size = 1),               
              sample(c(0:9,LETTERS,letters),replace = F,size = 1),
              sample(c(0:9,LETTERS,letters),replace = F,size = 1),sep = "")
      )       
    }
    
    
    table(duplicated(mydata$ID))

FALSE  TRUE 
99996     4 

Upvotes: 2

Views: 673

Answers (3)

Allan Cameron
Allan Cameron

Reputation: 174278

(length(letters) + length(LETTERS) + length(0:9))^5 is 91,6132,832, so there is plenty of space to avoid clashes.

In fact, we can use this number to help generate our sample. We draw 100,000 integers out of 91,6132,832 without replacement and interpret each number as its unique string of characters using a bit of modular math and indexing. This can all be done in a single pass:

space <- c(LETTERS, letters, 0:9)

set.seed(0)

samps <- sample(length(space)^5, 10^5)

m <- matrix("", nrow = 10^5, ncol = 5)

for(i in seq(ncol(m))) {
  m[,i] <- space[(samps %% length(space)) + 1]
  samps <- samps %/% length(space)
}

ID <- apply(m, 1, paste, collapse = "")

We can see this fulfils our requirements:

head(ID)
#> [1] "vpdnq" "rK0ej" "ofE9t" "PqLIr" "6G6tu" "Vhc7R"

length(ID)
#> [1] 100000

length(unique(ID))
#> [1] 100000

The whole thing takes less than a second on my modest machine:

   user  system elapsed 
   0.72    0.00    0.74 

Update

It occurs to me that it is possible to give 100,000 people a unique ID using only 16 characters, i.e. 0-9 and a-f, with code that is much quicker and simpler than above:

set.seed(0)
ID <- as.hexmode(sample(16^5, 10^5))
head(ID)
#> [1] "d43f9" "392a7" "033a2" "cf1d7" "aa10e" "134bb"

length(unique(ID))
#> [1] 100000

Which takes less than 10 milliseconds.

Created on 2022-05-15 by the reprex package (v2.0.1)

Upvotes: 4

Gregor Thomas
Gregor Thomas

Reputation: 145965

If you don't need randomness, the highly performant arrangements package can help by iterating over the permutations in order, not generating any more than are needed:

library(arrangements)

x = c(letters, LETTERS, 0:9)
ix = ipermutations(x = x, k = 5)

ind = ix$getnext(d = nrow(mydata))
mydata$ID = apply(ind, MAR = 1, FUN = \(i) paste(x[i], collapse = ""))

rbind(head(mydata), tail(mydata))
#           ID    Poids
# 1      abcde 64.46278
# 2      abcdf 62.00053
# 3      abcdg 75.71787
# 4      abcdh 67.73765
# 5      abcdi 66.45402
# 6      abcdj 66.85561
# 99995  abFpe 56.20545
# 99996  abFpf 64.14443
# 99997  abFpg 70.70191
# 99998  abFph 66.83226
# 99999  abFpi 65.22835
# 100000 abFpj 56.28880

This is quite fast:

  user  system elapsed 
  0.194   0.001   0.203 

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 102241

You can try the code below (given N <- 1e5 and k <- 5):

n <- ceiling(N^(1 / k))
S <- sample(c(LETTERS, letters, 0:9), n)
ID <- head(do.call(paste0, expand.grid(rep(list(S), k))),N)

where

  • n gives a subset of the whole space that supports all unique combinations up to given number N, e.g., N <- 100000
  • S denotes a sub-space from which we draw the alphabets or digits
  • expand.grid gives all combinations

Upvotes: 2

Related Questions