Reputation: 1279
I have a requirement where I have the alphabet 'ACGT' and I need to create a string of around 20,000 characters. This string should contain 100+ occurrences of the pattern "CCGT". Most of the time the generated string contains it around 20-30 instances.
int N = 20000;
std::string alphabet("ACGT");
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
str += alphabet[rand() % (alphabet.length())];
}
How do I tweak the code so that the pattern would appear more often?
Edit - Is there a way of changing the alphabet, i.e - 'A', 'C', 'G', 'T', 'CCGT' as characters of the alphabet?
Thank you.
Upvotes: 3
Views: 232
Reputation: 4763
You should make N
larger.
I take this liberty because you say 'create a string of around 20,000 characters'; but there's more to it than that.
If you're only finding around 20-30 instances in a string of 20000 characters then something is wrong. A ballpark estimate is to say that there are 20000 character positions to test, and at each of these there will be a four-letter string from an alphabet of four letters, giving a 1/256 chance of it being a specific string. The average should be (approximately; because I've oversimplified) 20000/256, or 78 hits.
It could be that your string isn't properly randomised (likely due to the use of the modulo idiom), or perhaps you're testing only every fourth character position -- as if the string were a list of non-overlapping four-letter words.
If you can bring your average hit rate back up to 78, then you can reach a little further to your 100 requirement simply by increasing N
proportionally.
Upvotes: 1
Reputation: 1837
I like the answer by @Andy Newman and think that is probably the best way - the code below is a compilable example of what they suggested.
#include <string>
#include <algorithm>
#include <iostream>
int main()
{
srand(time(0));
int N = 20000;
std::string alphabet("ACGT");
std::string str;
str.reserve(N);
int int_array[19700];
// Populate int array
for (int index = 0; index < 19700; index++)
{
if (index < 100)
{
int_array[index] = 0;
}
else
{
int_array[index] = (rand() % 4) + 1;
}
}
// Want the array in a random order
std::random_shuffle(&int_array[0], &int_array[19700]);
// Now populate string from the int array
for (int index = 0; index < 19700; index++)
{
switch(int_array[index])
{
case 0:
str += alphabet;
break;
case 1:
str += 'A';
break;
case 2:
str += 'C';
break;
case 3:
str += 'G';
break;
case 4:
str += 'T';
break;
default:
break;
}
}
// Print out to check what it looks like
std::cout << str << std::endl;
}
Upvotes: 1
Reputation: 48645
The solution I came up with uses a std::vector
to contain all the random sets of 4 chars including the 100 special sequence. I then shuffle that vector to distribute the 100 special sequence randomly throughout the string.
To make the distribution of letters even I create an alternative alphabet
string called weighted
that contains a relative abundance of alphabet
characters according to what has already been included from the 100 special sequence.
int main()
{
std::srand(std::time(0));
using std::size_t;
const size_t N = 20000;
std::string alphabet("ACGT");
// stuff the ballot
std::vector<std::string> v(100, "CCGT");
// build a properly weighted alphabet string
// to give each letter equal chance of appearing
// in the final string
std::string weighted;
// This could be scaled down to make the weighted string much smaller
for(size_t i = 0; i < (N - 200) / 4; ++i) // already have 200 Cs
weighted += "C";
for(size_t i = 0; i < (N - 100) / 4; ++i) // already have 100 Ns & Gs
weighted += "GT";
for(size_t i = 0; i < N / 4; ++i)
weighted += "A";
size_t remaining = N - (v.size() * 4);
// add the remaining characters to the weighted string
std::string s;
for(size_t i = 0; i < remaining; ++i)
s += weighted[std::rand() % weighted.size()];
// add the random "4 char" sequences to the vector so
// we can randomly distribute the pre-loaded special "4 char"
// sequence among them.
for(std::size_t i = 0; i < s.size(); i += 4)
v.push_back(s.substr(i, 4));
// distribute the "4 char" sequences randomly
std::random_shuffle(v.begin(), v.end());
// rebuild string s from vector
s.clear();
for(auto&& seq: v)
s += seq;
std::cout << s.size() << '\n'; // should be N
}
Upvotes: 1
Reputation: 1208
Generate an array of ints containing 100 x 0s and 490 1s, 2s, 3s and 4s [000000....111111....2222 etc] making almost 20,000 entries.
Then random shuffle it (std::random_shuffle)
Then write a string where each 0 translates to 'CCGT', each 1 translates to 'A', each 2 .... etc
I think that gives you what you want, and by tweaking the original array of ints you could change the number of 'A' characters in the output too.
Edit: If that isn't random enough, do 100 0s at the start and then random 1-4 for the rest.
Upvotes: 2
Reputation: 1837
My first thought would be to generate a list of 100 indexes where you will definitely insert the special string. Then as you generate the random string, insert the special string at each of these indexes as you reach them.
I've missed out checking that the intervals are spaced appropriately (cannot be within 4 of another interval) and also sorting them in ascending order - both of which would be necessary for this to work.
int N = 20000;
std::string alphabet("ACGT");
int intervals[100];
for (int index = 0; index < 100; index)
{
intervals[index] = rand() % 2000;
// Do some sort of check to make sure each element of intervals is not
// within 4 of another element and that no elements are repeated
}
// Sort the intervals array in ascending order
int current_interval_index = 0;
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
if (index == intervals[current_interval_index])
{
str += alphabet;
current_interval_index++;
index += 3;
}
else
{
str += alphabet[rand() % (alphabet.length())];
}
}
Upvotes: 1
Reputation: 71120
The only solution I can think of that would meet the "100+" criteria is:
create 20000 character string
number of instances (call it n) = 100 + some random value
for (i = 0 ; i < n ; ++i)
{
pick random start position
write CCGT
}
Of course, you'd need to ensure the overwritten characters weren't part of a "CCGT" already.
Upvotes: 1