Reputation: 39
I have two files, "test" & "sample". each file contains "rs-numbers" followed by "genotypes".
The test file is smaller than the sample file. only have around 150 rs-numbers+their genotypes.
However, the sample file contains more than 900k rs-numbers+their genotypes.
readTest() opens "test.tsv", read through the file by line, and returns a vector of tuples. the tuple holds (rs-number, geno type).
analyze() takes the result from readTest(), opens the sample file, read the file by line, and then do the comparison.
example from the sample file:
rs12124811\t1\t776546\tAA\r\n
rs11240777\t1\t798959\tGG\r\n
rs12124811 and rs11240777 are the rs-numbers. AA and GG are their genotypes.
the running time would be n*m. In my c++ version of this program, it takes 30 seconds while python version only takes 15 seconds and 5 seconds with multiprocessing.
vector<tuple<QString, QString>> readTest(string test){
// readTest can be done instantly
return tar_gene;
}
// tar_gene is the result from readTest()
// now call analyze. it reads through the sample.txt by line and does
// comparison.
QString analyze(string sample_name,
vector<tuple<QString, QString>> tar_gene
){
QString data_matches;
QFile file(QString::fromStdString(sample_name));
file.open(QIODevice::ReadOnly);
//skip first 20 lines
for(int i= 0; i < 20; i++){
file.readLine();
}
while(!file.atEnd()){ // O(m)
const QByteArray line = file.readLine();
const QList<QByteArray> tokens = line.split('\t');
// tar_gene is the result from readTest()
for (auto i: tar_gene){ // O(n*m)
// check if two rs-numbers are matched
if (get<0>(i) == tokens[0]){
QString i_rs = get<0>(i);
QString i_geno = get<1>(i);
QByteArray cur_geno = tokens[3].split('\r')[0];
// check if their genotypes are matched
if(cur_geno.length() == 2){
if (i_geno == cur_geno.at(0) || i_geno == cur_geno.at(1)){
data_matches += i_rs + '-' + i_geno + '\n';
break; // rs-numbers are unique. we can safely break
// the for loop
}
}
// check if their genotypes are matched
else if (cur_geno.length() == 1) {
if (i_geno == cur_geno.at(0)){
data_matches += i_rs + '-' + i_geno + '\n';
break; // rs-numbers are unique. we can safely break
// the for loop
}
}
}
}
}
return data_matches; // QString data_matches will be used in main() and
// printed out in text browser
}
Here is the full source code
#include "mainwindow.h"
#include "ui_mainwindow.h"
MainWindow::MainWindow(QWidget *parent) :
QMainWindow(parent),
ui(new Ui::MainWindow)
{
ui->setupUi(this);
}
MainWindow::~MainWindow()
{
delete ui;
}
QString analyze(string sample_name,
vector<tuple<QString, QString>> tar_gene,
int start, int end){
QString rs_matches, data_matches;
QFile file(QString::fromStdString(sample_name));
file.open(QIODevice::ReadOnly);
//skip first 20 lines
for(int i= 0; i < 20; i++){
file.readLine();
}
while(!file.atEnd()){
const QByteArray line = file.readLine();
const QList<QByteArray> tokens = line.split('\t');
for (auto i: tar_gene){
if (get<0>(i) == tokens[0]){
QString i_rs = get<0>(i);
QString i_geno = get<1>(i);
QByteArray cur_geno = tokens[3].split('\r')[0];
if(cur_geno.length() == 2){
if (i_geno == cur_geno.at(0) || i_geno == cur_geno.at(1)){
data_matches += i_rs + '-' + i_geno + '\n';
break;
}
}
else if (cur_geno.length() == 1) {
if (i_geno == cur_geno.at(0)){
data_matches += i_rs + '-' + i_geno + '\n';
break;
}
}
}
}
}
return data_matches;
}
vector<tuple<QString, QString>> readTest(string test){
vector<tuple<QString, QString>> tar_gene;
QFile file(QString::fromStdString(test));
file.open(QIODevice::ReadOnly);
file.readLine(); // skip first line
while(!file.atEnd()){
QString line = file.readLine();
QStringList templist;
templist.append(line.split('\t')[20].split('-'));
tar_gene.push_back(make_tuple(templist.at(0),
templist.at(1)));
}
return tar_gene;
}
void MainWindow::on_pushButton_analyze_clicked()
{
if(ui->comboBox_sample->currentIndex() == 0){
ui->textBrowser_rs->setText("Select a sample.");
return;
}
if(ui->comboBox_test->currentIndex() == 0){
ui->textBrowser_rs->setText("Select a test.");
return;
}
string sample = (ui->comboBox_sample->currentText().toStdString()) + ".txt";
string test = ui->comboBox_test->currentText().toStdString() + ".tsv";
vector<tuple<QString, QString>> tar_gene;
QFile file_test(QString::fromStdString(test));
if (!file_test.exists()) {
ui->textBrowser_rs->setText("The test file doesn't exist.");
return;
}
tar_gene = readTest(test);
QFile file_sample(QString::fromStdString(sample));
if (!file_sample.exists()) {
ui->textBrowser_rs->setText("The sample file doesn't exist.");
return;
}
clock_t t1,t2;
t1=clock();
QString result = analyze(sample, tar_gene, 0, 0);
t2=clock();
float diff ((float)t2-(float)t1);
float seconds = diff / CLOCKS_PER_SEC;
qDebug() << seconds;
ui->textBrowser_rsgeno->setText(result);
}
How do I make it run faster? I remade my program in c++ because I expected to see a better performance than python version!
with @Felix helps, my program takes 15 seconds now. I will try multithreading later.
here are the examples of the source data files:
(test.tsv) rs17760268-C rs10439884-A rs4911642-C rs157640-G ... and more. They are not sorted.
(sample.txt) rs12124811\t1\t776546\tAA\r\n rs11240777\t1\t798959\tGG\r\n ... and more. They are not sorted.
any recommendation for better data structure or algorithms?
Upvotes: 0
Views: 410
Reputation: 7146
There are actually a bunch of things you can do to optimize this code.
for(const auto &i : tar_gene)
. There are many more instances. Basically, try to avoid anything that is not a reference as much as possible. This also means to use std::move
and rvalue references wherever possible.DEFINES += QT_USE_QSTRINGBUILDER
to your pro file.QString
or QByteArray
for your whole code. Mixing them means that Qt has to convert them everytime you compare them.These are just the most basic and simplest things you can do. Try them and see how much speed you can gain. If it's still not enough, you can either try to further optimize the mathematical algorithm you are implementing here or look deeper into C/C++ to learn all the small tricks you can do for more speed.
EDIT: You can also try to gain speed by splitting the code up on multiple threads. Doing this by hand is not a good idea - have a look at QtConcurrent
if you are interested in that.
Upvotes: 2