Fei
Fei

Reputation: 1498

How to generate random numbers from a given range with provided distribution probability

Suppose I have a list of files and given probability (larger number indicates higher probability)

How can I generate a random sequence to simulate the relative probability, just like shuf tool does.

The length of the sequence might be shorter than the number of file set. This should be part of the input to a shell function, so any lightweight solution (using traditional Unix tools) would be preferred, while make use of heavy libraries or platforms (like Matlab) is not good.

Upvotes: 1

Views: 38

Answers (2)

karakfa
karakfa

Reputation: 67507

awk to the rescue!

 $ awk -v n=10 '{k=a[NR-1]+$2; a[NR]=k; v[k]=$1}
             END{srand();
                 for(j=1;j<=n;j++) 
                    {r=int(rand()*a[NR])+1; 
                     for(i=1;i<=NR;i++) 
                         if(r<=a[i])  {print v[a[i]]; break}}}' weights


$ cat weights
fileA 8
fileB 1
fileC 3
fileD 4

usage, creates 10 random samples based on relative weights

$ awk -v n=10 '...' weights
fileA
fileA
fileA
fileA
fileA
fileA
fileA
fileD
fileD
fileA

Upvotes: 1

John1024
John1024

Reputation: 113914

To select a file randomly with relative probabilities given by:

$ cat file
fileA (8)
fileB (1)
fileC (3)
fileD (4)

Use this:

$ awk -F'[ ()]' '{for (i=1;i<=$(NF-1);i++) print $1}' file |shuf | head -n1
fileD

Upvotes: 1

Related Questions