Reputation: 126

Generate A Random String With A Set of Banned Substrings

I want to generate a random string of a fixed length L. However, there is a set of "banned" substrings all of length b that cannot appear in the string. Is there a way to algorithmically generate this parent string?

Here is a small example:

I want a string that is 10 characters long -> XXXXXXXXXX The banned substrings are {'AA', 'CC', 'AD'} The string ABCDEFGHIJ is a valid string, but AABCDEFGHI is not.

For a small example it is relatively easy to randomly generate and then check the string, but as the set of banned substrings gets larger (or the length of the banned substrings gets smaller), the probability of randomly generating a valid string rapidly decreases.

Upvotes: 3

Answers (3)

Dave

Reputation: 9065

Simple approach:

Create a map where the keys are the banned substrings excluding the final letter, and the values are the lists of allowed letters. Alternatively you can have the values be the lists of banned letters with slight modification.

Generate random letters, using the set of allowed letters for the (b-1)-length substring at the end of the letters already generated, or the full alphabet if that substring doesn't match a key.

This is O(n*b) unless there's a way to update the substring of the last b-1 characters in O(1).

Ruby Code

def randstr(strlen, banned_strings)
  b = banned_strings[0].length # All banned strings are length b

  alphabet = ('a'..'z').to_a

  prefix_to_legal_letters = Hash.new {|h, k| h[k] = ('a'..'z').to_a}
  
  banned_strings.each do |s|
    prefix_to_legal_letters[s[0..-2]].delete(s[-1])
  end
  
  str = ''
  
  while str.length < strlen
    letters_to_use = alphabet
    if str.length >= b-1
      str_end = str[(-b+1)..-1]
      if prefix_to_legal_letters.has_key?(str_end)
        letters_to_use = prefix_to_legal_letters[str_end]
      end
    end
    str += letters_to_use.sample()
  end
  
  return str
end

A silly example to show this works. The only legal letter after 'a' is 'z'. Every 'a' in the output is followed by a 'z'.

randstr(1000, ['aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak', 'al', 'am', 'an', 'ao',
'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw', 'ax', 'ay'])

=> "gkowhxkhrknrxkbjxbjwiqohvvazwrjxjdekrujdyprjnmbjuklqsjdwzidhpgzzmnfbyjuptbpyezfeeydgdkpznvjfwziazrzohwvnitnfupdqxivtvkbazpvqzdzzsneslrazmhbjojyqowhvjhsrdgpbejicitprxzmkhgsuvvlyfizmhohorazemyhtbazvhvazdmnjmjzoggwmjjnrqxcmrdhxozbsjjdqgfjorazmtwtvvujpgivdxijowgxnkuxovncnivazmtykliqiielsfixuflfsgqbpevazozfsvfynhxyjpxtuolqooowazpyoukssknxdntzjjbqazxjttdblepsjzqgxmxvtrmjbgvuyfvspdrrohmtwhtdxfcvidswxtzbznszsqorpxdywbytsitxeziudmvlnluwmcqtfydxlocltozovhusbblfutbqjfjeslverzctxazyprazxzmazxwbdfkwxdwdqxnqhbcliwuitsnnpscbsjitoftblgjycpnxqsikpjqysmqiazdazwwjmeazxcbejthnlsskhazxazlrceyjtbmcpazscazvsjkqhiqfbjygjhyqazsbjymsovojfxynygzwmlhkmpvswpweqkkvmbrxhazpmiqrazcgprlbywmqpyvtphydniazovrkolzbslsosjvdqkgrjmcorqtgeazfwskjuhndszliiirtncmzrzhocyazyrhhpbcsmneuiktyswvgqwkzswkjnyuazggnreeccyidvrbxuskrlchjxnrrpljilogxmicjvmoeequbpkursrqsisqtfkruswnyftdgbjhwvcrlcnfecyfdnslmxztlbfxjhgeslqedrflthlhnlwopmsdjgochxwxhfhvqcixvxdjixcazggmexidtlhymkiyyfuhxufvxyfazmmwsbrlooqwfphgfhvthspvmyiazdazggpeuhnpjmzsazfxmsukpd"

Note that this code and approach needs modification if the OPs statement that all banned substrings are length b is modified -- either directly (by giving banned substrings of varying lengths), or implicitly (by having a (b-1)-length substring for which every letter is banned, in which case the (b-1)-length substring is effectively banned.

The modification here is the obvious one of checking all the possible key lengths in the map; it's still O(n*b) assuming b is the longest banned substring.

Upvotes: 0

btilly

Reputation: 46389

This will be a fairly efficient approach, but it requires a lot of theory.

First you can take your list of strings, like in this case AA, BB, AD. This can trivially be turned into a regular expression that matches any of them, namely /AA|BB|AD/. Which you can then turn into an Nondeterministic Finite Automaton (NFA) and a Deterministic Finite Automaton (DFA) for matching the regular expression. See these lecture notes for an example of how to do that. In this case the DFA will have the following states:

Matched nothing (yet)
Matched A
Matched B
End of match

And the transition rules will be:

If A go to state 2, if B go to state 3, else go to state 1.
If A or D go to state 4, else go to state 1.
If B go to state 4, else go to state 1.
Match complete, we're done. (Stay in state 4 forever.)

Now normally a DFA is used to find a match. We're going to use it as a way to find ways to avoid a match. So how will we do that?

The trick here is dynamic programming.

What we will do is create a table of:

by position in the string
    by state of the match
        how many ways there are to get here
        how many ways we got here from the previous (position, state) pairs

In other words we go forward and create a table which starts like this:

[
    {1: {'count': 1}},
    {1: {'count': 24, 'prev': {1: 24}},
     2: {'count': 1, 'prev': {1: 1}},
     3: {'count': 1, 'prev': {1: 1}},
    },
    {1: {'count': 625, 'prev': {1: 576, 2: 24, 3: 25}},
     2: {'count': 25, 'prev': {1: 24, 3: 1}},
     3: {'count': 25, 'prev': {1: 24, 2: 1}},
     4: {'count': 3, 'prev': {2: 2, 3: 1}},
    },
    ...         
]

By the time this is done, we know exactly how many ways we can wind up at the end of the string with a match (state 4), partial match (states 2 or 3) or not currently matching (state 1).

Now we generate the random string backwards. First we randomly pick the final state with odds based on the count. Then from that final state's prev entry we can pick the state we were on before that final one. Then we randomly pick a letter that would have done that. We are picking that letter/state combination completely randomly from all solutions.

Now we know what state we were in at the second to last letter. Pick the previous state, then pick a letter in the same way.

Continue back until finally we know the state we're in after we've picked the first letter, and then pick that letter.

It is a lot of logic to write, but it is all deterministic in time, and will let you pick random strings all day long once you've done the initial analysis.

Upvotes: 7

Melvin Lang

Reputation: 209

There are two ways.

Random String is created while repeating until the ban string does not appear.

#include <bits/stdc++.h>
using namespace std;
const int N = 1110000;

char ban[] = "AA";
char rnd[N];

int main() {
    srand(time(0));
    int n = 100;
    do {
        for (int i = 0; i < n; i++) rnd[i] = 'A'+rand()%26;
        rnd[n] = 0;
    } while (strstr(rnd, ban));
    cout << rnd << endl;
}

I think this is the easiest way to implement. However, this method has complexity up to O((26/25)^n*(n+b)), if the length of the string to be created is very long and the length of the ban string is very small. For example if ban="A", n=10000, then there will be time limit exceed!

You can proceed with the creation of one character and one character while checking whether there is a ban string.

If you want to use this way, you must know about KMP algorithm. In this way, we can not use the system default search function strstr.

#include <bits/stdc++.h>
using namespace std;
const int N = 1110000;

char ban[] = "A";
char rnd[N];

int b = strlen(ban);
int n = 10;

int *pre_kmp() {
    int *pie=new int [b];
    pie[0] = 0;
    int k=0;
    for(int i=1;i<b;i++)
    {
        while(k>0 && ban[k] != ban[i] )
        {
            k=pie[k-1];
        }
        if( ban[k] == ban[i] )
        {
            k=k+1;
        }
        pie[i]=k;
    }
    return pie;
}


int main() {
    srand(time(0));
    
    int *pie = pre_kmp();
    int matched_pos = 0;
    for (int cur = 0; cur < n; cur++) {
        do {
            rnd[cur] = 'A'+rand()%26;
            while (matched_pos > 0 && ban[matched_pos] != rnd[cur])
                matched_pos = pie[matched_pos-1];
            if (ban[matched_pos] == rnd[cur])
                matched_pos = matched_pos+1;
        } while (matched_pos == b);
    }
    
    cout << rnd << endl;
}

This algorithm's time complexity will be O(26 * n * b). Of course you could also search manually without using the KMP algorithm. However, in this case, the time complexity becomes O(n*b), so the time complexity becomes very large when both the ban string length and the generator string length are long.

Hope this helps you.

Upvotes: 2

Generate A Random String With A Set of Banned Substrings

Answers (3)

Related Questions