alberto
alberto

Reputation: 2645

Create N random integers with no gaps

For a clustering algorithm that I'm implementing, I would like to initialize the clusters assignments at random. However, I need that there are no gaps. That is, this is not ok:

set.seed(2)
K <- 10 # initial number of clusters
N <- 20 # number of data points
z_init <- sample(K,N, replace=TRUE) # initial assignments
z_init
#  [1]  2  8  6  2 10 10  2  9  5  6  6  3  8  2  5  9 10  3  5  1
sort(unique(z_init))
# [1]  1  2  3  5  6  8  9 10

where labels 4 and 7 have not been used.

Instead, I would like this vector to be:

#  [1]  2  6  5  2 8 8  2  7  4  5  5  3  6  2  4  7 8  3  4  1

where the label 5 has become 4 and so forth to fill the lower empty labels.

More examples:

Can it be done avoiding for loops? I don't need it to be fast, I prefer it to be elegant and short, since I'm doing it only once in the code (for label initialization).

My solution using a for loop

z_init <- c(3,2,1,3,3,7,9)

idx <- order(z_init)
for (i in 2:length(z_init)){
  if(z_init[idx[i]] > z_init[idx[i-1]]){
    z_init[idx[i]] <- z_init[idx[i-1]]+1
  }
  else{
    z_init[idx[i]] <- z_init[idx[i-1]]  
  }

}

z_init
# 3 2 1 3 3 4 5

Upvotes: 5

Views: 142

Answers (4)

slamballais
slamballais

Reputation: 3235

Edit: @GregSnow came up with the current shortest answer. I'm 100% convinced that this is the shortest possible way.

For fun, I decided to golf the code, i.e. write it as short as possible:

z <- c(3, 8, 4, 4, 8, 2, 3, 9, 5, 1, 4)
# solution by hand: 1 2 3 3 4 4 4 5 6 6 7

sort(c(factor(z))) # 18 bits, as proposed by @GregSnow in the comments
# [1] 1 2 3 3 4 4 4 5 6 6 7

Some other (functioning) attempts:

y=table(z);rep(seq(y),y) # 24 bits
sort(unclass(factor(z))) # 24 bits, based on @GregSnow 's answer
diffinv(diff(sort(z))>0)+1 # 26 bits
sort(as.numeric(factor(z))) # 27 bits, @GregSnow 's original answer
rep(seq(unique(z)),table(z)) # 28 bits
cumsum(c(1,diff(sort(z))>0)) # 28 bits
y=rle(sort(z))$l;rep(seq(y),y) # 30 bits

Edit2: Just to show that bits isn't everything:

z <- sample(1:10,10000,replace=T)
Unit: microseconds
                                      expr      min        lq      mean    median        uq      max neval
                        sort(c(factor(z))) 2550.128 2572.2340 2681.4950 2646.6460 2729.7425 3140.288   100
   {     y = table(z)     rep(seq(y), y) } 2436.438 2485.3885 2580.9861 2556.4440 2618.4215 3070.812   100
                  sort(unclass(factor(z))) 2535.127 2578.9450 2654.7463 2623.9470 2708.6230 3167.922   100
            diffinv(diff(sort(z)) > 0) + 1  551.871  572.2000  628.6268  626.0845  666.3495  940.311   100
               sort(as.numeric(factor(z))) 2603.814 2672.3050 2762.2030 2717.5050 2790.7320 3558.336   100
             rep(seq(unique(z)), table(z)) 2541.049 2586.0505 2733.5200 2674.0815 2760.7305 5765.815   100
           cumsum(c(1, diff(sort(z)) > 0))  530.159  545.5545  602.1348  592.3325  632.0060  844.385   100
{  y = rle(sort(z))$l     rep(seq(y), y) }  661.218  684.3115  727.4502  724.1820  758.3280  857.412   100

z <- sample(1:100000,replace=T)
Unit: milliseconds
                                      expr       min        lq     mean    median       uq       max neval
                        sort(c(factor(z))) 84.501189 87.227377 92.13182 89.733291 94.16700 150.08327   100
   {     y = table(z)     rep(seq(y), y) } 78.951701 82.102845 85.54975 83.935108 87.70365 106.05766   100
                  sort(unclass(factor(z))) 84.958711 87.273366 90.84612 89.317415 91.85155 121.99082   100
            diffinv(diff(sort(z)) > 0) + 1  9.784041  9.963853 10.37807 10.090965 10.34381  17.26034   100
               sort(as.numeric(factor(z))) 85.917969 88.660145 93.42664 91.542263 95.53720 118.44512   100
             rep(seq(unique(z)), table(z)) 86.568528 88.300325 93.01369 90.577281 94.74137 118.03852   100
           cumsum(c(1, diff(sort(z)) > 0))  9.680615  9.834175 10.11518  9.963261 10.16735  14.40427   100
 { y = rle(sort(z))$l     rep(seq(y), y) } 12.842614 13.033085 14.73063 13.294019 13.66371 133.16243   100

Upvotes: 3

Greg Snow
Greg Snow

Reputation: 49640

A simple (but possibly inefficient) approach is to convert to a factor then back to numeric. Creating the factor will code the information as integers from 1 to the number of unique values, then add labels with the original values. Converting to numeric then drops the labels and leaves the numbers:

> x <- c(1,2,3,5,6,8)
> (x2 <- as.numeric(factor(x)))
[1] 1 2 3 4 5 6
> 
> xx <- c(15,5,7,7,10)
> (xx2 <- as.numeric(factor(xx)))
[1] 4 1 2 2 3
> (xx3 <- as.numeric(factor(xx, levels=unique(xx))))
[1] 1 2 3 3 4

The levels = portion in the last example sets the numbers to match the order in which they appear in the original vector.

Upvotes: 3

Julius Vainora
Julius Vainora

Reputation: 48211

You could do like this:

un <- sort(unique(z_init))
(z <- unname(setNames(1:length(un), un)[as.character(z_init)]))
# [1] 2 6 5 2 8 8 2 7 4 5 5 3 6 2 4 7 8 3 4 1
sort(unique(z))
# [1] 1 2 3 4 5 6 7 8

Here I replace elements of un in z_init with corresponding elements of 1:length(un).

Upvotes: 3

josliber
josliber

Reputation: 44330

It seems to me that you are trying to randomly assign elements of a set (the numbers 1 to 20) to clusters, subject to the requirement that each cluster be assigned at least one element.

One approach that I could think of would be to select a random reward r_ij for assigning element i to cluster j. Then I would define binary decision variables x_ij that indicate whether element i is assigned to cluster j. Finally, I would use mixed integer optimization to select the assignment from elements to clusters that maximizes the collected reward subject to the following conditions:

  • Every element is assigned to exactly one cluster
  • Every cluster has at least one element assigned to it

This is equivalent to randomly selecting an assignment, keeping it if all clusters have at least one element, and otherwise discarding it and trying again until you get a valid random assignment.

In terms of implementation, this is pretty easy to accomplish in R using the lpSolve package:

library(lpSolve)
N <- 20
K <- 10
set.seed(144)
r <- matrix(rnorm(N*K), N, K)
mod <- lp(direction = "max",
          objective.in = as.vector(r),
          const.mat = rbind(t(sapply(1:K, function(j) rep((1:K == j) * 1, each=N))),
                            t(sapply(1:N, function(i) rep((1:N == i) * 1, K)))),
          const.dir = c(rep(">=", K), rep("=", N)),
          const.rhs = rep(1, N+K),
          all.bin = TRUE)
(assignments <- apply(matrix(mod$solution, nrow=N), 1, function(x) which(x > 0.999)))
#  [1]  6  5  3  3  5  6  6  9  2  1  3  4  7  6 10  2 10  6  6  8
sort(unique(assignments))
#  [1]  1  2  3  4  5  6  7  8  9 10

Upvotes: 3

Related Questions