templatetypedef
templatetypedef

Reputation: 372664

Choosing a uniformly-random string with length at most n?

Suppose that I want to choose a uniformly-random string from all strings of length at most n (let's assume there's a fixed set of characters that the string can be made of, such as the letters A - Z). If I knew in advance what the length of the string was, I could easily choose a random string by choosing each character of that string uniformly at random. However, to guarantee that we pick strings uniformly at random, I can't just pick a uniformly random length and then pick a random string of that length, since if you were to pick a totally random string it would more often than not have a larger length than a shorter one, since there are more long strings than short ones.

Is there a known algorithm for choosing a string of length at most n uniformly at random?

Thanks!

Upvotes: 3

Views: 1497

Answers (6)

maxim1000
maxim1000

Reputation: 6365

Number of all strings up to length n is (26^(n+1)-1)/(26-1).

The idea is to decide if the string is empty. Probability of this is (26-1)/(26^(n+1)-1). To generate event of such probability, we generate 26^(n+1) events, ignore one of them and select 25 events from the rest.

char GenerateRandomCharacter()
{
    ...
}
std::string GenerateRandomStringOfFixedLength(int length)
{
    std::string result;
    for(int c=0;c<length;++c)
        result.push_back(c);
    return result;
}
bool WillWeGenerateEmptyString(int maxLength)
{
    while(true)
    {
        const std::string sample=GenerateRandomStringOfFixedLength(maxLength+1);
        if(sample==std::string(maxLength+1,'A'))
            continue;//this leaves 26^n-1 values
        else
            return sample.substr(1)==std::string(maxLength,'A');//only 25 strings satisfy this
    }
}
std::string Generate(int maxLength)
{
    if(WillWeGenerateEmptyString(maxLength))
        return std::string();
    else
    {
        std::string result;
        result.push_back(GenerateRandomCharacter());
        result+=Generate(maxLength-1);
        return result;
    }
}

Upvotes: 0

Keith
Keith

Reputation: 6834

Can't you just work out the distribution of lengths given the size of the character set?

Determine the ratio of length k strings to strings of length shorter than k. This follows from: wikipedia.

So, assume a maximal string and then randomly determine the relative chance of a shorter string.

If shorter repeat to see if n-1 or less.

I think this approach handles rounding errors reasonably cleanly. The chance of actually getting a very short strings when n is reasonable size is still very small, but representative.

To do the sums, we want:

k^n samples of length n
k^(n-1) of length n-1
etc.
k of length 1
1 of length 0

p(length < x)/p(length <= x)
= sum(1+..+k^x-1)/sum(1+..+k^x)
= (1 - k^-x)/ (k-k^-x)

So we can implement like this:

int getLength(int n, int setSize)
{
    if (n == 0)
        return 0;
    double oneOverKtoTheN = pow(1.0/setSize, (double)n);
    double pLengthN = (1-oneOverKtoTheN)/(setSize - oneOverKtoTheN);
    double sample = ((double) rand()) / RAND_MAX;
    if (sample < pLengthN)
        return n;
    return getLength(n-1, setSize);
}

Note how oneOverKtoTheN may be dropped due to floating point to start with, but as n decreases it starts to count as is should.

Upvotes: 0

David Eisenstat
David Eisenstat

Reputation: 65427

The distribution of n minus the length of a uniform random string is the same as X mod (n+1) where X is a geometric with range [0, infinity) and success probability 1-1/k and k is the number of letters in the alphabet. To choose a string exactly uniformly at random and without resorting to bignums: sample the geometric mod (n+1) (e.g., by sampling letters uniformly until one that is not A comes up, then returning the number of non-As generated mod (n+1)). Generate a string of length n minus that value.

Upvotes: 3

Penguino
Penguino

Reputation: 2176

An alternate method, if you require somewhat larger strings, is to build each string randomly from 27 characters, where the 27th char is an end-of-string character. You will have to reject any string that gets longer than your maximum acceptable n, but the generation should be fairly efficient. So to produce random strings with the 'correct' distribution and length in the range 0 to n you could use a more efficient version of the following:

Function RandomString(n : integer) : string;
var
  RandomChar : char;
begin
  result := '';
  repeat
    RandomChar := Char('a' + Random(27));
    if RandomChar in ['a'..'z'] then
      result := result + RandomChar;
    if Length(result) > n then
      result := '';
  until RandomChar not in ['a'..'z'];
end;

Upvotes: -1

Anil Vaitla
Anil Vaitla

Reputation: 2978

There are 26 characters, and the length is at most n. So the total number of strings is:

Total Number of Strings = \sum_{i=1}^n 26^i

We need each of these to be chosen with equal probability that is:

P(string s is chosen) = 1 / TotalNumStrings

Now consider the strategy you proposed of choosing a random length and then choosing a random string of that length. So by Bayes rule we have:

P(string s being chosen which has length i) =
     P(string s being chosen | string has length i) *
     P(we want a string of length i) = (1 / 26^i) * (1 / n) = 1 / (26^i * n)

which is not equal to 1 / TotalNumStrings. You already knew this wasn't going to work, however this motivates the correct selection strategy.

Now choose strings as follows:

P(string s being chosen which has length i) =
     P(string s being chosen | string has length i) *
     P(we want a string of length i) = 
         1 / (26^i) *  P(chosen string has length i) = 1 / NumStrings.

Thus we have P(chosen string has length i) = 26^i / NumStrings! Tada.

So to summarize the selection strategy is as follows. First choose the length i, with probability 26^i / NumStrings. Then in that category choose an arbitrary string.

Upvotes: 2

Joey
Joey

Reputation: 354356

Each letter makes adds another factor of the number of possible characters, so there are 26 one-letter strings, 26 × 26 two-letter strings, and so on. You just need to first pick a length at random by scaling accordingly.

E.g. you can pick a random number at most 308915776 and select the string length as follows:

< 26        - 1
< 702       - 2
< 15576     - 3
< 456976    - 4
< 11881376  - 5
< 308915776 - 6

The numbers get a little large quickly-ish, though, so that might work as long as your n is small. Otherwise you can use floating-point numbers and use ranges between 0 and 1 instead.

Upvotes: 2

Related Questions