Reputation: 107
I have a very large set of strings for which I want to find the subset of unique strings and I am using the set container. The methods go out to a MySQL database, pull in a new group of strings and tries to add them to a set. I check the return from the insert to determine if the string was added (first occurrence) or it is already present.
#include <iostream>
#include <string>
#include <fstream>
#include <algorithm>
#include <vector>
#include <iostream>
#include "CDR3Sample.h"
#include "MySQLConnect.h"
using namespace std;
int main() {
CDR3SetReturn ret;
//CDR3Set is a typedef on set<string>
CDR3Set total;
try
{
MySQLConnect connection;
cerr << "size of master " << connection.getMasterSize() << endl;
SampleIDList list = connection.getSampleIDList();
SampleIDList ids_seen;
cerr << "size of raw ID list " << list.size() << endl;
for (SampleIDListIterator it=list.begin(); it != list.end(); it++) {
// We're going to skip it if the table doesn't exist or if the sample has already been processed
if (connection.checkTable(*it) && find(ids_seen.begin(), ids_seen.end(), *it)!=list.end()) {
CDR3Sample s(*it, connection);
int valid_number = 0;
for (CDR3SetIterator sit=s.begin(); sit != s.end(); sit++) {
ret = total.insert(*sit);
if (ret.second) {
valid_number++;
}
}
cout << *it << " " << s.getLength() << " " << valid_number << " " << total.size() << endl;
ids_seen.push_back(*it);
} else {
cerr << *it << " table not found" << endl;
}
}
}
catch (int i)
{
// Need to put code here to save state of calculation
std::cerr << "Exception thrown by MySQLConnect " << i << std::endl;
exit(-1);
}
// Need to put code here to save state of calculation
cerr << "size of total " << total.size() << endl;
ofstream ofs ("cdr3_tally.test", ifstream::out);
int it_count=0;
while (ofs.good()) {
for (CDR3SetIterator it=total.begin(); it != total.end(); ++it) {
cout << it_count << " " << *it << endl;
it_count++;
}
}
ofs.close();
cerr << "it_count " << it_count << endl;
ofs_naive.close();
return 0;
}
I'll leave the supporting code out for brevity, but I can provide it.
When it gets to the end, it has the correct number of entries:
size of master 9243
size of raw ID list 1
~MySQLConnect
size of total 372
But the loop that write out the set just keeps going and going for millions of lines. If I use sort -u on the output, it has the correct number of entries.
I am stumped. The code looks OK to me. It's not the complicated.
Can anyone see something that I have done wrong? Should I make a formal class out of CDR3Set instead of a typdef?
I am using g++ on ubuntu
$ g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.1-2ubuntu1~12.04' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.8.1 (Ubuntu 4.8.1-2ubuntu1~12.04)
Thanks
Mike
Upvotes: 1
Views: 110
Reputation: 9383
Your cout
for loop is enclosed in while(ofs.good())
. Nothing inside the for loop will ever make it bad, so it keeps looping over the set and printing everything again and again.
Upvotes: 2