tov_Kirov
tov_Kirov

Reputation: 21

C++ read/write big files

I am very new in C++!Therefore I would really appreciate if you will consider it and answer as easy as it is possible. I need to parse fasta file with >40000 sequences (near 500Mb) and write ID and sequence length into the new file. I found that it is going very slowly in C++ and for this purpose python works much faster. But I need to learn how I can do it in C++. I am wonder are there any ways to fasten this process for C++?

This is my code:

#include <iostream>
#include <fstream>
#include <string>
#include <time.h>
#include <stdio.h>

using namespace std;
int main() {
    time_t start, end;
    time(&start);
    clock_t begin = clock();
    ifstream file;
    string line;
    string id;
    string content;
    int len = 0;
    int i = 0;
    ofstream out;

    file.open("contigs.fasta", ios::in);
    out.open("output.txt", ios::out);
    while (getline(file, line)) {
        if (line[0] == '>') {
            i++;
            if (i != 1) {
            //cout << id << "\n" << len << "\n" << content << endl;

                //out.write(line.c_str(), line.size());
            out << id << " : " << len << endl;
            }
            id = line;
            len = 0;
            content = "";
        }
        else
        {
            len += line.length();
            content += line;
        }
    }
    //cout << id << "\n" << len << "\n" << content << endl;
    //out << id << " : " << len << endl;
    cout << "Total number of sequences :" << i << "\n";
    out.close();
    time (&end);
double dif = difftime (end,start);
printf ("Elasped time is %.2lf seconds.", dif );
    return 0;
}

Thanks in advance!

Upvotes: 0

Views: 2486

Answers (3)

Christophe
Christophe

Reputation: 73637

Why is it slow ?

A fasta file can be quite big. But that's in no way an issue in C++. The best way to know would be to use a profiler.

But here, string allocation is a very good candidate root cause: every line read is added at the end of the string, causing the string to grow. This means frequent reallocation because of content's growth, which causes allocation, copying, deallocation of memory, and much more than needed !

Such approach might cause heap fragmentation, and considerably slowing down the process if done several hundreds of thousands of times. Fortunately, there are several strategies to do this faster.

How to speed it easily ?

You can use reserve() to preallocate engough space for content. This can be an easy accelerator, especially if you know the average size of your nucleotide. But even if you don't, it can reduce a lot of the reallocation efforts.

Just try this to observe if there's a difference:

    content.reserve (100000);   // just before entering into the loop.   

How to speed it further ?

Another approach which can be very effective as well is to determine the size of your fasta-file with seekg() and tellg(), then load the file in memory in a single read with fread(), and parse/process it directly where you've read it.

With this very raw approach you should obtain throughput in the Gb/s range.

Last but not least, don't forget to compile your C++ code in release mode (optimizer on) for performance measurements.

Upvotes: 1

MSalters
MSalters

Reputation: 180303

You're using out << ... << endl. That flushes the single line directly to disk. Since disks aren't character-oriented, it means a read-modify-write operation.

Instead, use out << '\n' to write just a newline. The disk cache will handle this.

Upvotes: 1

DerickThePoney
DerickThePoney

Reputation: 75

Maybe you should read the whole file or a block of it into a preallocated string. And then use a std::stringstream to process the file as needed: Here is an example of what I use in my programs. My files are not as big but they contain thousands of lines each of which is then parsed for specific characters, copied, etc. And this takes only a few ms (around 50ms for the biggest files, loading up and parsing).

//1- read the file
std::string str; // allocate string
{
    //compute file size
    int iFileSize = 0;
    {
        std::ifstream ifstr(rkFilename.c_str(), std::ios::binary); // create the file stream    - this is scoped for destruction    

        if(!ifstr.good())
        {
            return;
        }

        //get the file size
        iFileSize = ifstr.tellg();
        ifstr.seekg( 0, std::ios::end ); // open file at the end to get the size
        iFileSize = (I32) ifstr.tellg() - iFileSize;
    }

    //reopen the file for reading this time
    std::ifstream ifstr(rkFilename.c_str());

    //create a char* with right size
    char* pcFileBuffer = new char[iFileSize];

    //copy the full file in there
    ifstr.read(pcFileBuffer, iFileSize);

    //put it all into a string - could be optimised I guess
    str = std::string(pcFileBuffer);

    //bookeeping
    delete[] pcFileBuffer;
    pcFileBuffer = NULL;
}

// create a stream using the allocated string
// this stream works as a file reader basically so you can extract lines into string, etc...
std::stringstream filebuf(str);

//the rest is up to you

Adapt this to read a chuncks if you don't have enough space to read a full 500Mb file into your memory...

One more optimisation you could do. As @Adrian said, the content += line is pretty slow... looking at your code, you may want to look for the '>' character while saving start and stop indexes, while not copying data. You would then only allocate the memory once and copy data around using the start and stop indexes found (Or just build a data structure of start and stop indexes :-)). That's what I use to parse my files. I make use of std::string's find_first_of, find_first_not_of, find_last_of and substr methods. While these are probably suboptimal, they keep the code readable and are fast enough for my purpose.

I hope my answer gives you hint at what to do, and that it helps you speed up your program.

Also, it is a good idea to use a profiler to determine what's taking you the most time. It's native on Visual studio 2015, for instance.

Best regards

Upvotes: 2

Related Questions