Reputation: 21
I am very new in C++!Therefore I would really appreciate if you will consider it and answer as easy as it is possible. I need to parse fasta file with >40000 sequences (near 500Mb) and write ID and sequence length into the new file. I found that it is going very slowly in C++ and for this purpose python works much faster. But I need to learn how I can do it in C++. I am wonder are there any ways to fasten this process for C++?
This is my code:
#include <iostream>
#include <fstream>
#include <string>
#include <time.h>
#include <stdio.h>
using namespace std;
int main() {
time_t start, end;
time(&start);
clock_t begin = clock();
ifstream file;
string line;
string id;
string content;
int len = 0;
int i = 0;
ofstream out;
file.open("contigs.fasta", ios::in);
out.open("output.txt", ios::out);
while (getline(file, line)) {
if (line[0] == '>') {
i++;
if (i != 1) {
//cout << id << "\n" << len << "\n" << content << endl;
//out.write(line.c_str(), line.size());
out << id << " : " << len << endl;
}
id = line;
len = 0;
content = "";
}
else
{
len += line.length();
content += line;
}
}
//cout << id << "\n" << len << "\n" << content << endl;
//out << id << " : " << len << endl;
cout << "Total number of sequences :" << i << "\n";
out.close();
time (&end);
double dif = difftime (end,start);
printf ("Elasped time is %.2lf seconds.", dif );
return 0;
}
Thanks in advance!
Upvotes: 0
Views: 2486
Reputation: 73637
Why is it slow ?
A fasta file can be quite big. But that's in no way an issue in C++. The best way to know would be to use a profiler.
But here, string allocation is a very good candidate root cause: every line read is added at the end of the string, causing the string to grow. This means frequent reallocation because of content
's growth, which causes allocation, copying, deallocation of memory, and much more than needed !
Such approach might cause heap fragmentation, and considerably slowing down the process if done several hundreds of thousands of times. Fortunately, there are several strategies to do this faster.
How to speed it easily ?
You can use reserve()
to preallocate engough space for content
. This can be an easy accelerator, especially if you know the average size of your nucleotide. But even if you don't, it can reduce a lot of the reallocation efforts.
Just try this to observe if there's a difference:
content.reserve (100000); // just before entering into the loop.
How to speed it further ?
Another approach which can be very effective as well is to determine the size of your fasta-file with seekg()
and tellg()
, then load the file in memory in a single read with fread()
, and parse/process it directly where you've read it.
With this very raw approach you should obtain throughput in the Gb/s range.
Last but not least, don't forget to compile your C++ code in release mode (optimizer on) for performance measurements.
Upvotes: 1
Reputation: 180303
You're using out << ... << endl
. That flushes the single line directly to disk. Since disks aren't character-oriented, it means a read-modify-write operation.
Instead, use out << '\n'
to write just a newline. The disk cache will handle this.
Upvotes: 1
Reputation: 75
Maybe you should read the whole file or a block of it into a preallocated string. And then use a std::stringstream
to process the file as needed: Here is an example of what I use in my programs. My files are not as big but they contain thousands of lines each of which is then parsed for specific characters, copied, etc. And this takes only a few ms (around 50ms for the biggest files, loading up and parsing).
//1- read the file
std::string str; // allocate string
{
//compute file size
int iFileSize = 0;
{
std::ifstream ifstr(rkFilename.c_str(), std::ios::binary); // create the file stream - this is scoped for destruction
if(!ifstr.good())
{
return;
}
//get the file size
iFileSize = ifstr.tellg();
ifstr.seekg( 0, std::ios::end ); // open file at the end to get the size
iFileSize = (I32) ifstr.tellg() - iFileSize;
}
//reopen the file for reading this time
std::ifstream ifstr(rkFilename.c_str());
//create a char* with right size
char* pcFileBuffer = new char[iFileSize];
//copy the full file in there
ifstr.read(pcFileBuffer, iFileSize);
//put it all into a string - could be optimised I guess
str = std::string(pcFileBuffer);
//bookeeping
delete[] pcFileBuffer;
pcFileBuffer = NULL;
}
// create a stream using the allocated string
// this stream works as a file reader basically so you can extract lines into string, etc...
std::stringstream filebuf(str);
//the rest is up to you
Adapt this to read a chuncks if you don't have enough space to read a full 500Mb file into your memory...
One more optimisation you could do. As @Adrian said, the content += line
is pretty slow... looking at your code, you may want to look for the '>'
character while saving start and stop indexes, while not copying data. You would then only allocate the memory once and copy data around using the start and stop indexes found (Or just build a data structure of start and stop indexes :-)). That's what I use to parse my files. I make use of std::string
's find_first_of
, find_first_not_of
, find_last_of
and substr
methods. While these are probably suboptimal, they keep the code readable and are fast enough for my purpose.
I hope my answer gives you hint at what to do, and that it helps you speed up your program.
Also, it is a good idea to use a profiler to determine what's taking you the most time. It's native on Visual studio 2015, for instance.
Best regards
Upvotes: 2