user3096277
user3096277

Reputation: 107

c++ set outputs many more elements than it contains

I have a very large set of strings for which I want to find the subset of unique strings and I am using the set container. The methods go out to a MySQL database, pull in a new group of strings and tries to add them to a set. I check the return from the insert to determine if the string was added (first occurrence) or it is already present.

#include <iostream>
#include <string>
#include <fstream>
#include <algorithm>
#include <vector>
#include <iostream>

#include "CDR3Sample.h"
#include "MySQLConnect.h"

using namespace std;

int main() {

    CDR3SetReturn ret;
    //CDR3Set is a typedef on set<string>
    CDR3Set total;

    try
    {
            MySQLConnect connection;
            cerr << "size of master " << connection.getMasterSize() << endl;

            SampleIDList list = connection.getSampleIDList();
            SampleIDList ids_seen;
            cerr << "size of raw ID list " << list.size() << endl;


            for (SampleIDListIterator it=list.begin(); it != list.end(); it++) {
                    // We're going to skip it if the table doesn't exist or if the sample has already been processed
                if (connection.checkTable(*it) && find(ids_seen.begin(), ids_seen.end(), *it)!=list.end()) {
            CDR3Sample s(*it, connection);
            int valid_number = 0;
            for (CDR3SetIterator sit=s.begin(); sit != s.end(); sit++) {
                ret = total.insert(*sit);
                if (ret.second) {
                    valid_number++;
                }
            }
            cout << *it << " " << s.getLength() << " " << valid_number << " " << total.size() << endl;
            ids_seen.push_back(*it);
                } else {
                    cerr << *it << " table not found" << endl;
                }
            }
    }
    catch (int i)
    {
            // Need to put code here to save state of calculation
        std::cerr << "Exception thrown by MySQLConnect " << i << std::endl;

        exit(-1);
    }

    // Need to put code here to save state of calculation
    cerr << "size of total " << total.size() << endl;
    ofstream ofs ("cdr3_tally.test", ifstream::out);
    int it_count=0;
    while (ofs.good()) {
        for (CDR3SetIterator it=total.begin(); it != total.end(); ++it) {
            cout << it_count << " " << *it  << endl;
            it_count++;
        }
    }
    ofs.close();
    cerr << "it_count " << it_count << endl;

    ofs_naive.close();


    return 0;
}

I'll leave the supporting code out for brevity, but I can provide it.

When it gets to the end, it has the correct number of entries:

size of master 9243
size of raw ID list 1
~MySQLConnect
size of total 372

But the loop that write out the set just keeps going and going for millions of lines. If I use sort -u on the output, it has the correct number of entries.

I am stumped. The code looks OK to me. It's not the complicated.

Can anyone see something that I have done wrong? Should I make a formal class out of CDR3Set instead of a typdef?

I am using g++ on ubuntu

$ g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.1-2ubuntu1~12.04' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.8.1 (Ubuntu 4.8.1-2ubuntu1~12.04)

Thanks

Mike

Upvotes: 1

Views: 110

Answers (1)

dlf
dlf

Reputation: 9383

Your cout for loop is enclosed in while(ofs.good()). Nothing inside the for loop will ever make it bad, so it keeps looping over the set and printing everything again and again.

Upvotes: 2

Related Questions