Reputation: 43

Fast counting of nucleotide types in a large number of sequences

First, a bit of background about my question.
I work as a bioinformatician, which means that I do informatics treatment to try to answer a biological question. In my problem, I have to manipulate a file called a FASTA file which looks like this :

>Header 1  
ATGACTGATCGNTGACTGACTGTAGCTAGC  
>Header 2  
ATGCATGCTAGCTGACTGATCGTAGCTAGC  
ATCGATCGTAGCT

So a FASTA file is basically just a header, preceded by a '>' character, then a sequence on one or multiple lines that is composed of nucleotides. Nucleotides are characters that can take 5 possible values : A, T, C, G or N.

The thing I would like to do is count the number of times each nucleotide type appears so if we consider this dummy FASTA file :

>Header 1  
ATTCGN

I should have, as a result :
A:1 T:2 C:1 G:1 N:1

Here is what I got so far :

ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;

while(getline(sequence_file, line)) {
    if(line[0] != '>') {
        sequence += line;
    }
    else {
        nucleotide_counts['A'] = boost::count(sequence, 'A');
        nucleotide_counts['T'] = boost::count(sequence, 'T');
        nucleotide_counts['C'] = boost::count(sequence, 'C');
        nucleotide_counts['G'] = boost::count(sequence, 'G');
        nucleotide_counts['N'] = boost::count(sequence, 'N');
        sequence = "";
    }
}

So it reads the file line by line, if it encounters a '>' as the first character of the line, it knows that the sequence is complete and starts to count. Now the problem I'm facing is that I have millions of sequences with several billions of nucleotides in total. I can see that my method is not optimized because I call boost::count five times on the same sequence.

Other things I have tried :

Parsing the sequence to increment a counter for each nucleotide types. I tried using a map<char, double> to map each nucleotide to a value but this was slower than the boost solution.
Using the std::count of the algorithm library but this was too slow too.

I searched the internet for solutions but every solution I found was good if the number of sequences was low, which is not my case. Would you have any idea that could help me speed things up ?

EDIT 1 : I also tried this version but it was 2 times slower than the boost one :

ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;

while(getline(sequence_file, line)) {
    if(line[0] != '>') {
        sequence += line;
    }
    else {
        for(int i = 0; i < sequence.size(); i++) {
           nucleotide_counts[sequence[i]]++;
        }
        sequence = "";
    }
}

EDIT 2 : Thanks to everyone in this thread, I was able to obtain a speed up of about 30 times compared to the boost original solution. Here is the code :

#include <map> // std::array
#include <fstream> // std::ifstream
#include <string> // std::string  

void count_nucleotides(std::array<double, 26> &nucleotide_counts, std::string sequence) {
    for(unsigned int i = 0; i < sequence.size(); i++) {
        ++nucleotide_counts[sequence[i] - 'A'];
    }
}  

std::ifstream sequence_file(input_file.c_str());
std::string line;
std::string sequence = "";
std::array<double, 26> nucleotide_counts;

while(getline(sequence_file, line)) {
    if(line[0] != '>') {
        sequence += line;
    }
    else {
        count_nucleotides(nucleotide_counts, sequence);
        sequence = "";
    }
}

Upvotes: 2

Answers (5)

kvantour

Reputation: 26481

If this is the main task you have to perform, you might have an interest in an awk solution. Various problems with FASTA files are very easily tackled with awk:

awk '/^>/ && c { for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" ; delete a }
    /^>/ {print; c++; next}
    { for(i=1;i<=length($0);++i) a[substr($0,i,1)]++ }
    END{ for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" }' fastafile

This outputs on your example:

>Header 1  
N:1 A:7 C:6 G:8 T:8 
>Header 2  
A:10 C:10 G:11 T:12

note: I am aware that this is not C++, but it is often useful to show other means to achieve the same goal.

Benchmarks with awk:

testfile: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
unziped size: 2.3G
total records: 5502947
total lines:

Script 0: (runtime: too long) The first mentioned script is utterly slow. Use only on small files

Script 1: (runtime: 484.31 sec) This is an optimised version where we do a targetted count:

/^>/ && f { for(i in c) printf i":"c[i]" "; print "" ; delete c }
/^>/ {print; f++; next}
{   s=$0
    c["A"]+=gsub(/[aA]/,"",s)
    c["C"]+=gsub(/[cC]/,"",s)
    c["G"]+=gsub(/[gG]/,"",s)
    c["T"]+=gsub(/[tT]/,"",s)
    c["N"]+=gsub(/[nN]/,"",s)
}
END { for(i in c) printf i":"c[i]" "; print "" ; delete c }

Update 2: (runtime: 416.43 sec) Combine all the subsequences into a single sequence and count only ones:

function count() {
    c["A"]+=gsub(/[aA]/,"",s)
    c["C"]+=gsub(/[cC]/,"",s)
    c["G"]+=gsub(/[gG]/,"",s)
    c["T"]+=gsub(/[tT]/,"",s)
    c["N"]+=gsub(/[nN]/,"",s)
}
/^>/ && f { count(); for(i in c) printf i":"c[i]" "; print "" ; delete c; string=""}
/^>/ {print; f++; next}
{ string=string $0 }
END { count(); for(i in c) printf i":"c[i]" "; print "" }

Update 3: (runtime: 396.12 sec) Refine how awk finds its records and fields, and abuse this in a single go.

function count() {
    c["A"]+=gsub(/[aA]/,"",string)
    c["C"]+=gsub(/[cC]/,"",string)
    c["G"]+=gsub(/[gG]/,"",string)
    c["T"]+=gsub(/[tT]/,"",string)
    c["N"]+=gsub(/[nN]/,"",string)
}
BEGIN{RS="\n>"; FS="\n"}
{
  print $1
  string=substr($0,length($1)); count()
  for(i in c) printf i":"c[i]" "; print ""
  delete c; string=""
}

Update 4: (runtime: 259.69 sec) Update the regex search in gsub. This creates a worthy speedup:

function count() {
    n=length(string);
    gsub(/[aA]+/,"",string); m=length(string); c["A"]+=n-m; n=m
    gsub(/[cC]+/,"",string); m=length(string); c["C"]+=n-m; n=m
    gsub(/[gG]+/,"",string); m=length(string); c["G"]+=n-m; n=m
    gsub(/[tT]+/,"",string); m=length(string); c["T"]+=n-m; n=m
    gsub(/[nN]+/,"",string); m=length(string); c["N"]+=n-m; n=m
}
BEGIN{RS="\n>"; FS="\n"}
{
  print ">"$1
  string=substr($0,length($1)); count()
  for(i in c) printf i":"c[i]" "; print ""
  delete c; string=""
}

Upvotes: 2

Max Langhof

Reputation: 23701

In order of importance:

Good code for this task will 100% be I/O-bound. Your processor can count characters much faster than your disk can pump them to the CPU. Thus, the first question to me is: What is the throughput of your storage medium? What are your ideal RAM and cache throughputs? Those are the upper limits. If you've hit them, there's not much point in looking at your code further. It's possible that your boost solution is there already.
std::map lookups are relatively expensive. Yes, it's O(log(N)), but your N=5 is small and constant, so this tells you nothing. For 5 values, the map will have to chase about three pointers for every lookup (not to mention how impossible this is for the branch predictor). Your count solution has 5 map lookups and 5 traversals of each string, whereas your manual solution has a map lookup for every nucleotide (but only one traversal of the string).

Serious suggestion: Use a local variable for each counter. Those will almost surely get placed in CPU registers and are therefore essentially free. You won't ever pollute your cache with the counters that way, unlike map, unordered_map, vector etc.
Replacing abstraction by repetition like this is usually not a good idea, but in this case, it's pretty inconceivable that you'll ever need significantly more counters, so scalability is not an issue.
Consider std::string_view (which would require a different method of reading the file) to avoid creating copies of the data. You load the entire data into memory from disk and then, for each sequence, you copy it. That's not really necessary and (depending on how smart your compiler is) can bog you down. Especially since you keep appending to the string until the next header (which is more unnecessary copying - you could just count after every line).
If, for some reason, you are not hitting the theoretical throughputs, consider multithreading and/or vectorization. But I can't imagine this would be necessary.

By the way, boost::count is a thin wrapper around std::count at least in this version.

I think you did the right thing here though: Writing good and readable code, then identifying it as performance bottleneck and checking if you can make it run faster (potentially by making it slightly more ugly).

Upvotes: 1

Nelfeal

Reputation: 13269

Don't use a map if you want speed and can use an array. Also, std::getline can use a custom delimiter (instead of \n).

ifstream sequence_file(input_file.c_str());
string sequence = "";
std::array<int, 26> nucleotide_counts;

// For one sequence
getline(sequence_file, sequence, '>');
for(auto&& c : sequence) {
    ++nucleotide_counts[c-'A'];
}

// nucleotide_counts['X'-'A'] contains the count of nucleotide X in the sequence

Demo

Upvotes: 0

s.paszko

Reputation: 732

Like people in comments suggested, try sth like that

enum eNucleotide {
    NucleotideA = 0,
    NucleotideT,
    NucleotideC,
    NucleotideG,
    NucleotideN,
    Size,
};

void countSequence(std::string line)
{
    long nucleotide_counts[eNucleotide::Size] = { 0 };

    if(line[0] != '>') {
        for(int i = 0; i < line.size(); ++i) 
        {
           switch (line[i])
           {
               case 'A':
                   ++nucleotide_counts[NucleotideA];
                   break;
               case 'T':
                   ++nucleotide_counts[NucleotideT];
                   break;                   
               case 'C':
                   ++nucleotide_counts[NucleotideC];
                   break;                   
               case 'G':
                   ++nucleotide_counts[NucleotideC];
                   break;                   
               case 'N':
                   ++nucleotide_counts[NucleotideN];
                   break;                   
               default : 
                   /// error condition
                   break;
           }     
        }

    /// print results
    std::cout << "A: " << nucleotide_counts[NucleotideA];
    std::cout << "T: " << nucleotide_counts[NucleotideT];
    std::cout << "C: " << nucleotide_counts[NucleotideC];
    std::cout << "G: " << nucleotide_counts[NucleotideG];
    std::cout << "N: " << nucleotide_counts[NucleotideN] << std::endl;
    }
}

and call this function for every line content.(Didn't tested code.)

Upvotes: 0

Matthieu Brucher

Reputation: 22023

The reason why it's so slow is that you have indirect accesses all the time or 5 scans of the same string.

You don't need a map, use 5 integers, and increment them separately. Then it should be faster than the boost::count version because you don't traverse the string 5 times, and it will be faster than the map or the unordered_map increments because you won't have n indirect accesses.

so use something like:

switch(char)
{
case 'A':
    ++a;
    break;
case 'G':
    ++g;
    break;
}
...

Upvotes: 0

Fast counting of nucleotide types in a large number of sequences

Answers (5)

Related Questions