Reputation: 5004
How can I take 1 million substring from a string with more than 3 million characters efficiently in C#? I have written a program which involves reading random DNA reads (substrings from random position) of length say 100 from a string with 3 million characters. There are 1 million such reads. Currently i run a while loop that runs 1 million times and read a substring of 100 character length from the string with 3 million character. This is taking a long time. What can i do to complete this faster?
heres my code, len is the length of the original string, 3 million in this case, it may be as low as 50 thats why the check in the while loop.
while(i < 1000000 && len-100> 0) //len is 3000000
{
int randomPos = _random.Next()%(len - ReadLength);
readString += all.Substring(randomPos, ReadLength) + Environment.NewLine;
i++;
}
Upvotes: 2
Views: 2902
Reputation: 4680
Using a StringBuilder to assemble the string will get you a 600 times increase in processing (as it avoids repeated object creation everytime you append to the string.
before loop (initialising capacity avoids recreating the backing array in StringBuilder):
StringBuilder sb = new StringBuilder(1000000 * ReadLength);
in loop:
sb.Append(all.Substring(randomPos, ReadLength) + Environment.NewLine);
after loop:
readString = sb.ToString();
Using a char array instead of a string to extract the values yeilds another 30% improvement as you avoid object creation incurred when calling Substring():
before loop:
char[] chars = all.ToCharArray();
in loop:
sb.Append(chars, randomPos, ReadLength);
sb.AppendLine();
Edit (final version which does not use StringBuilder and executes in 300ms):
char[] chars = all.ToCharArray();
var iterations = 1000000;
char[] results = new char[iterations * (ReadLength + 1)];
GetRandomStrings(len, iterations, ReadLength, chars, results, 0);
string s = new string(results);
private static void GetRandomStrings(int len, int iterations, int ReadLength, char[] chars, char[] result, int resultIndex)
{
Random random = new Random();
int i = 0, index = resultIndex;
while (i < iterations && len - 100 > 0) //len is 3000000
{
var i1 = len - ReadLength;
int randomPos = random.Next() % i1;
Array.Copy(chars, randomPos, result, index, ReadLength);
index += ReadLength;
result[index] = Environment.NewLine[0];
index++;
i++;
}
}
Upvotes: 2
Reputation: 14522
Edit: I abandoned the idea to use memcpy, and I think the result is super great. I've broken a 3m length string into 30k strings of 100 length each in 43 milliseconds.
private static unsafe string[] Scan(string hugeString, int subStringSize)
{
var results = new string[hugeString.Length / subStringSize];
var gcHandle = GCHandle.Alloc(hugeString, GCHandleType.Pinned);
var currAddress = (char*)gcHandle.AddrOfPinnedObject();
for (var i = 0; i < results.Length; i++)
{
results[i] = new string(currAddress, 0, subStringSize);
currAddress += subStringSize;
}
return results;
}
To use the method for the case shown in the question:
const int size = 3000000;
const int subSize = 100;
var stringBuilder = new StringBuilder(size);
var random = new Random();
for (var i = 0; i < size; i++)
{
stringBuilder.Append((char)random.Next(30, 80));
}
var hugeString = stringBuilder.ToString();
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 1000; i++)
{
var strings = Scan(hugeString, subSize);
}
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds / 1000); // 43
Upvotes: 0
Reputation: 2095
I think better solutions will come, but .NET StringBuilder class instances are faster than String class instances because it handles data as a Stream.
You can split the data in pieces and use .NET Task Parallel Library for Multithreading and Parallelism
Edit: Assign fixed values to a variable out of the loop to avoid recalculation;
int x = len-100
int y = len-ReadLength
use
StringBuilder readString= new StringBuilder(ReadLength * numberOfSubStrings);
readString.AppendLine(all.Substring(randomPos, ReadLength));
for Parallelism you should split your input to pieces. Then run these operations on pieces in seperate threads. Then combine the results.
Important: As my previous experiences showed these operations run faster with .NET v2.0 rather than v4.0, so you should change your projects target framework version; but you can't use Task Parallel Library with .NET v2.0 so you should use multithreading in oldschool way like
Thread newThread ......
Upvotes: 1
Reputation: 18815
How long is a long time ? It shouldn't be that long.
var file = new StreamReader(@"E:\Temp\temp.txt");
var s = file.ReadToEnd();
var r = new Random();
var sw = new Stopwatch();
sw.Start();
var range = Enumerable.Range(0,1000000);
var results = range.Select( i => s.Substring(r.Next(s.Length - 100),100)).ToList();
sw.Stop();
sw.ElapsedMilliseconds.Dump();
s.Length.Dump();
So on my machine the results were 807ms and the string is 4,055,442 chars.
Edit: I just noticed that you want a string as a result, so my above solution just changes to...
var results = string.Join(Environment.NewLine,range.Select( i => s.Substring(r.Next(s.Length - 100),100)).ToArray());
And adds about 100ms, so still under a second in total.
Upvotes: 0