Efficient way to find accumulated unique element count at each index in a vector in C++

Question

So I have a gene alignment result (I think it's from RNA-seq) in which specific sequences are matched to certain genes (those genes repeat themselves sometimes). I am now using C++ to find the earliest possible position at which most unique genes have been counted and I can confidently use only the first part for further analysis. The problem is that the file I have is sorted (so sequences from the same gene are put together), but I want to calculate that for non-sorted files.

What I am doing now is to manually std::shuffle my std::vector geneList before iterating through it. In each iteration I would compare every coming gene from the geneList to a unique gene list and update it if none within it matches that coming gene. Then I would sample count_gene and count_unique_gene with some interval and finally get the x% position. This is very costly...with only 140,000 genes costing me several minutes. Sample code (also included my input code for better understanding):

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

class COUNT{

public:
//    COUNT() : sampRate(1) {};
//    ~COUNT(){};
    void initialize();
    void readData();
    void calcCount();
    void writeCount();

private:
    int sampRate;    
    int countRead;
    int countUniqGene;
    string fname;
    vector geneList;
    vector sampCountRead, sampCountUniqGene;
};


void COUNT::readData(){
    std::cout<<"Type in fname: "<>fname;
    fname = "testdata.sam";
    std::cout<<"Type in sample rate: "<>sampRate;
    std::string line;
    std::fstream readFile(fname);
    while (getline(readFile, line)) {
        if (((char)line.back() == '-') | ((char)line.back() == '+')) {
            std::istringstream ss(line);
            std::string others, oriGeneName, realGeneName;
            ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others;
            if (ss >> others) {
                ss >> oriGeneName;
                realGeneName = oriGeneName.substr(5, oriGeneName.size());
                geneList.push_back(realGeneName);
            }
            else continue;
        }
        else continue;
    }
    readFile.close();
}

void COUNT::calcCount() {
    set uniqGeneList = {"test"};
    auto rng = std::default_random_engine{};
    std::shuffle(std::begin(geneList), std::end(geneList), rng);  // this is fast (0.02s)
    vector::iterator uniqIter;
    string geneName;
    for (vector::iterator iter = geneList.begin(); iter != geneList.end(); ++iter) {  //this is very slow
        geneName = string(iter[0]);
        if (all_of(uniqGeneList.begin(), uniqGeneList.end(), [geneName](const std::string gene) {return geneName != gene; })) {
            uniqGeneList.insert(geneName);
            countUniqGene++;
        }
        countRead++;
        if (countRead % sampRate < 1) {
            sampCountRead.push_back(countRead);
            sampCountUniqGene.push_back(countUniqGene);
        }
    }
}

int main(){
    (not important, plotting and intersecting)
}

Also sample data (I extracted genename from the 'GE:Z:' item):

V300067289_HH26L1C001R0010008289    784 7   141178046   3   2S48M   *   0   0   CCCCACCTGCTTGCGGACCCTAATGTGACGTTGGCGGATGAGCACACGGG  F)BF;E2A3*F<+AFFB-B,FE?FEFFF@EF3BFB;<:FECEF2DFF@CE  NH:i:2  HI:i:1  AS:i:47 nM:i:0  CB:Z:53_34_81098_51183  UR:Z:GTTTTATTA  UY:Z:E/E@EAG?F
V300067289_HH26L1C001R0010008294    1040    3   34078775    255 50M *   0   0   CCTTGTCTGGGTGATTTAATAGCATAATCCGGCGATGAGCATCCCTGATC  FGFBGFFFDEGEBFEFBDFEEFFGCFFFADEEFGFFFFGFGFFDFFEEFF  NH:i:1  HI:i:1  AS:i:49 nM:i:0  CB:Z:49_31_75043_46832  UR:Z:CCGGACCCA  UY:Z:EEEFDBFEA  XF:Z:CODING GE:Z:Dnajc19    GS:Z:-
V300067289_HH26L1C001R0010008295    1040    3   34078777    255 2S48M   *   0   0   CGTTGTCTGGGTGATTTAATACCAAAATCCGGCGATGAGCATCCCTGATC  E(FC6D>+EDEFECA?@?:C.'D4&03@&;:3?7CFAEF@C7A4.4.@B?  NH:i:1  HI:i:1  AS:i:43 nM:i:2  CB:Z:49_31_75043_46832  UR:Z:CCGGACCCA  UY:Z:CFFD?EEDF  XF:Z:CODING GE:Z:Dnajc19    GS:Z:-
V300067289_HH26L1C001R0010008298    1040    15  82351046    255 50M *   0   0   ACTTTATCCCGTCCTTGTTTCACCGTGATATCCAGCTGCATTAAGTGCAC  )EFFF=FE?DC=FABGFFF7F&=FEF9FBEE=BEB9FFF;FFCF9DBF86  NH:i:1  HI:i:1  AS:i:49 nM:i:0  CB:Z:49_34_74093_51329  UR:Z:CAATATAGG  UY:Z:DFFFFFFFF  XF:Z:CODING GE:Z:Ndufa6 GS:Z:-

I have thought of assuming gene appearance to follow a Poisson distribution and just count the unique gene number and calculate the confidence level of attaining x% at each position. But better use a simulation first. Thanks in advance!

Sam Varshavchik · Accepted Answer

set uniqGeneList = {"test"};

The whole purpose of a std::set is to be able to quickly (as in, logarithmic complexity) check for a presence of a specific value, whether the value exists in the set.

    if (all_of(uniqGeneList.begin(), uniqGeneList.end(),
        [geneName](const std::string gene) {return geneName != gene; })) {

So, rather than letting std::set do precisely what the sole purpose of its existence is, the sole reason it was invented for, this ends up iterating, one by one, manually, over every value in the set, and manually comparing it? As Mr. Spock would say: this is not logical.

This whole thing must be replaced, simply, by:

    if (uniqGeneList.find(geneName) == uniqGeneList.end()) {

A ballpark estimate: with a thousand or so values in the set this ends up replacing a thousand comparisons with eight. Some further performance gains may also be realized by replacing the std::set with std::unordered_set. You will need to profile this, to see if this will be the result in your use case.

A few other things worthy of note:

        std::istringstream ss(line);
        std::string others, oriGeneName, realGeneName;
        ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others; ss >> others;

Streams have a reputation for being horribly inefficient. This is also likely another part that sacrifices too many electrons. In performance sensitive situations, ditching the std::istringstream, rolling up one's sleeves, and iterating over the line, counting words as they whiz by, will also likely squeeze out some performance.

Efficient way to find accumulated unique element count at each index in a vector in C++

Answers (1)

Related Questions