Reputation: 43
First, a bit of background about my question.
I work as a bioinformatician, which means that I do informatics treatment to try to answer a biological question. In my problem, I have to manipulate a file called a FASTA file which looks like this :
>Header 1
ATGACTGATCGNTGACTGACTGTAGCTAGC
>Header 2
ATGCATGCTAGCTGACTGATCGTAGCTAGC
ATCGATCGTAGCT
So a FASTA file is basically just a header, preceded by a '>' character, then a sequence on one or multiple lines that is composed of nucleotides. Nucleotides are characters that can take 5 possible values : A, T, C, G or N.
The thing I would like to do is count the number of times each nucleotide type appears so if we consider this dummy FASTA file :
>Header 1
ATTCGN
I should have, as a result :
A:1 T:2 C:1 G:1 N:1
Here is what I got so far :
ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;
while(getline(sequence_file, line)) {
if(line[0] != '>') {
sequence += line;
}
else {
nucleotide_counts['A'] = boost::count(sequence, 'A');
nucleotide_counts['T'] = boost::count(sequence, 'T');
nucleotide_counts['C'] = boost::count(sequence, 'C');
nucleotide_counts['G'] = boost::count(sequence, 'G');
nucleotide_counts['N'] = boost::count(sequence, 'N');
sequence = "";
}
}
So it reads the file line by line, if it encounters a '>' as the first character of the line, it knows that the sequence is complete and starts to count. Now the problem I'm facing is that I have millions of sequences with several billions of nucleotides in total. I can see that my method is not optimized because I call boost::count
five times on the same sequence.
Other things I have tried :
map<char, double>
to map each nucleotide to a value but this was slower than the boost solution.std::count
of the algorithm library but this was too slow too.I searched the internet for solutions but every solution I found was good if the number of sequences was low, which is not my case. Would you have any idea that could help me speed things up ?
EDIT 1 : I also tried this version but it was 2 times slower than the boost one :
ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;
while(getline(sequence_file, line)) {
if(line[0] != '>') {
sequence += line;
}
else {
for(int i = 0; i < sequence.size(); i++) {
nucleotide_counts[sequence[i]]++;
}
sequence = "";
}
}
EDIT 2 : Thanks to everyone in this thread, I was able to obtain a speed up of about 30 times compared to the boost original solution. Here is the code :
#include <map> // std::array
#include <fstream> // std::ifstream
#include <string> // std::string
void count_nucleotides(std::array<double, 26> &nucleotide_counts, std::string sequence) {
for(unsigned int i = 0; i < sequence.size(); i++) {
++nucleotide_counts[sequence[i] - 'A'];
}
}
std::ifstream sequence_file(input_file.c_str());
std::string line;
std::string sequence = "";
std::array<double, 26> nucleotide_counts;
while(getline(sequence_file, line)) {
if(line[0] != '>') {
sequence += line;
}
else {
count_nucleotides(nucleotide_counts, sequence);
sequence = "";
}
}
Upvotes: 2
Views: 1041
Reputation: 26481
If this is the main task you have to perform, you might have an interest in an awk solution. Various problems with FASTA files are very easily tackled with awk:
awk '/^>/ && c { for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" ; delete a }
/^>/ {print; c++; next}
{ for(i=1;i<=length($0);++i) a[substr($0,i,1)]++ }
END{ for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" }' fastafile
This outputs on your example:
>Header 1
N:1 A:7 C:6 G:8 T:8
>Header 2
A:10 C:10 G:11 T:12
note: I am aware that this is not C++, but it is often useful to show other means to achieve the same goal.
Benchmarks with awk:
Script 0: (runtime: too long) The first mentioned script is utterly slow. Use only on small files
Script 1: (runtime: 484.31 sec) This is an optimised version where we do a targetted count:
/^>/ && f { for(i in c) printf i":"c[i]" "; print "" ; delete c }
/^>/ {print; f++; next}
{ s=$0
c["A"]+=gsub(/[aA]/,"",s)
c["C"]+=gsub(/[cC]/,"",s)
c["G"]+=gsub(/[gG]/,"",s)
c["T"]+=gsub(/[tT]/,"",s)
c["N"]+=gsub(/[nN]/,"",s)
}
END { for(i in c) printf i":"c[i]" "; print "" ; delete c }
Update 2: (runtime: 416.43 sec) Combine all the subsequences into a single sequence and count only ones:
function count() {
c["A"]+=gsub(/[aA]/,"",s)
c["C"]+=gsub(/[cC]/,"",s)
c["G"]+=gsub(/[gG]/,"",s)
c["T"]+=gsub(/[tT]/,"",s)
c["N"]+=gsub(/[nN]/,"",s)
}
/^>/ && f { count(); for(i in c) printf i":"c[i]" "; print "" ; delete c; string=""}
/^>/ {print; f++; next}
{ string=string $0 }
END { count(); for(i in c) printf i":"c[i]" "; print "" }
Update 3: (runtime: 396.12 sec) Refine how awk finds its records and fields, and abuse this in a single go.
function count() {
c["A"]+=gsub(/[aA]/,"",string)
c["C"]+=gsub(/[cC]/,"",string)
c["G"]+=gsub(/[gG]/,"",string)
c["T"]+=gsub(/[tT]/,"",string)
c["N"]+=gsub(/[nN]/,"",string)
}
BEGIN{RS="\n>"; FS="\n"}
{
print $1
string=substr($0,length($1)); count()
for(i in c) printf i":"c[i]" "; print ""
delete c; string=""
}
Update 4: (runtime: 259.69 sec) Update the regex search in gsub
. This creates a worthy speedup:
function count() {
n=length(string);
gsub(/[aA]+/,"",string); m=length(string); c["A"]+=n-m; n=m
gsub(/[cC]+/,"",string); m=length(string); c["C"]+=n-m; n=m
gsub(/[gG]+/,"",string); m=length(string); c["G"]+=n-m; n=m
gsub(/[tT]+/,"",string); m=length(string); c["T"]+=n-m; n=m
gsub(/[nN]+/,"",string); m=length(string); c["N"]+=n-m; n=m
}
BEGIN{RS="\n>"; FS="\n"}
{
print ">"$1
string=substr($0,length($1)); count()
for(i in c) printf i":"c[i]" "; print ""
delete c; string=""
}
Upvotes: 2
Reputation: 23701
In order of importance:
Good code for this task will 100% be I/O-bound. Your processor can count characters much faster than your disk can pump them to the CPU. Thus, the first question to me is: What is the throughput of your storage medium? What are your ideal RAM and cache throughputs? Those are the upper limits. If you've hit them, there's not much point in looking at your code further. It's possible that your boost solution is there already.
std::map
lookups are relatively expensive. Yes, it's O(log(N))
, but your N=5
is small and constant, so this tells you nothing. For 5 values, the map will have to chase about three pointers for every lookup (not to mention how impossible this is for the branch predictor). Your count
solution has 5 map lookups and 5 traversals of each string, whereas your manual solution has a map lookup for every nucleotide (but only one traversal of the string).
Serious suggestion: Use a local variable for each counter. Those will almost surely get placed in CPU registers and are therefore essentially free. You won't ever pollute your cache with the counters that way, unlike map
, unordered_map
, vector
etc.
Replacing abstraction by repetition like this is usually not a good idea, but in this case, it's pretty inconceivable that you'll ever need significantly more counters, so scalability is not an issue.
Consider std::string_view
(which would require a different method of reading the file) to avoid creating copies of the data. You load the entire data into memory from disk and then, for each sequence, you copy it. That's not really necessary and (depending on how smart your compiler is) can bog you down. Especially since you keep appending to the string until the next header (which is more unnecessary copying - you could just count after every line).
If, for some reason, you are not hitting the theoretical throughputs, consider multithreading and/or vectorization. But I can't imagine this would be necessary.
By the way, boost::count
is a thin wrapper around std::count
at least in this version.
I think you did the right thing here though: Writing good and readable code, then identifying it as performance bottleneck and checking if you can make it run faster (potentially by making it slightly more ugly).
Upvotes: 1
Reputation: 13269
Don't use a map if you want speed and can use an array. Also, std::getline
can use a custom delimiter (instead of \n
).
ifstream sequence_file(input_file.c_str());
string sequence = "";
std::array<int, 26> nucleotide_counts;
// For one sequence
getline(sequence_file, sequence, '>');
for(auto&& c : sequence) {
++nucleotide_counts[c-'A'];
}
// nucleotide_counts['X'-'A'] contains the count of nucleotide X in the sequence
Upvotes: 0
Reputation: 732
Like people in comments suggested, try sth like that
enum eNucleotide {
NucleotideA = 0,
NucleotideT,
NucleotideC,
NucleotideG,
NucleotideN,
Size,
};
void countSequence(std::string line)
{
long nucleotide_counts[eNucleotide::Size] = { 0 };
if(line[0] != '>') {
for(int i = 0; i < line.size(); ++i)
{
switch (line[i])
{
case 'A':
++nucleotide_counts[NucleotideA];
break;
case 'T':
++nucleotide_counts[NucleotideT];
break;
case 'C':
++nucleotide_counts[NucleotideC];
break;
case 'G':
++nucleotide_counts[NucleotideC];
break;
case 'N':
++nucleotide_counts[NucleotideN];
break;
default :
/// error condition
break;
}
}
/// print results
std::cout << "A: " << nucleotide_counts[NucleotideA];
std::cout << "T: " << nucleotide_counts[NucleotideT];
std::cout << "C: " << nucleotide_counts[NucleotideC];
std::cout << "G: " << nucleotide_counts[NucleotideG];
std::cout << "N: " << nucleotide_counts[NucleotideN] << std::endl;
}
}
and call this function for every line content.(Didn't tested code.)
Upvotes: 0
Reputation: 22023
The reason why it's so slow is that you have indirect accesses all the time or 5 scans of the same string.
You don't need a map, use 5 integers, and increment them separately. Then it should be faster than the boost::count
version because you don't traverse the string 5 times, and it will be faster than the map
or the unordered_map
increments because you won't have n indirect accesses.
so use something like:
switch(char)
{
case 'A':
++a;
break;
case 'G':
++g;
break;
}
...
Upvotes: 0