vector07
vector07

Reputation: 359

segmentation fault in unordered_map after ~ 10000 keys added

Segmentation fault occurs when datact=10736 when trying to insert into an unordered_map (see commented line for which call throws the fault). See below for attempted fixes.

When SIGSEGV is thrown it points me to line 764 of hashtable_policy.h

INPUT: data file with column1=counts, column2=16-character strings

PURPOSE: cluster 16-character sequences by adding all the counts of 1-substitution different sequences together. First sequence seen is "origin" by which all its 1-substitution friends are identified.

PSEUDOCODE: for each line in file:

  1. read count, read sequence.

  2. if sequence key_value exists in hash 'location' (type unordered_map), add current count;

  3. otherwise make a new key_value, have it point to the count here, and assign all 1-substitution sequences to also point to this count.

Code:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <cmath>
#include <vector>
#include <unordered_map>
#include <map>

#include "matrix.h"

using namespace std;

int nuc2num(char currchar)
{   // returns 0,1,2,3 for A,C,T,G respectively
    int outnum;

    if (currchar=='A')
    {
        outnum=0;
    }
    else if (currchar=='C')
    {
        outnum=1;
    }
    else if (currchar=='T')
    {
        outnum=2;
    }
    else
    {
        outnum=3;
    }
    return outnum;
}

int main(int argc, char* argv[])
{
    //command line arguments
    //  arg1=filename, arg2=barcode sequence length, arg3=#mismatches permitted

    //input handling
    //  file format: column 1 | column 2
    //               counts   | sequence    [int | string]

    string filename;
    string funnelstring;

    // define lookup matrix; rows=ACTG, cols = ACTG without row element
    Matrix <char> sub_lookup(4,3);
    sub_lookup[0][0] = 'C';
    sub_lookup[0][1] = 'T';
    sub_lookup[0][2] = 'G';
    sub_lookup[1][0] = 'A';
    sub_lookup[1][1] = 'T';
    sub_lookup[1][2] = 'G';
    sub_lookup[2][0] = 'A';
    sub_lookup[2][1] = 'C';
    sub_lookup[2][2] = 'G';
    sub_lookup[3][0] = 'A';
    sub_lookup[3][1] = 'C';
    sub_lookup[3][2] = 'T';

    int L,k;

    int j=0;

    const int buffersize=10000;
    int currentsize=buffersize;

    int datact=0;
    int currchar;

    vector <unsigned int> ctarr(buffersize);
    vector <string> seqarr(buffersize);

    filename=argv[1];
    L=atoi(argv[2]);
    k=atoi(argv[3]);

    unsigned int sct;
    int substrlen;
    string sequence,textct;
    ifstream seqfile (filename.c_str());

    //map <string,unsigned int*> location;
    unordered_map <string,unsigned int*> location;

    if (seqfile.is_open())
    {
        getline(seqfile,textct,'\n');
        while (textct != "")
        {
            sct=atoi(textct.c_str());
            substrlen=textct.length();
            //cout << textct << endl;
            sequence=textct.substr(substrlen-L,L);
            //cout << sequence << endl;

            //is there an associated guy?
            if (location.find(sequence) != location.end()) //just asks whether this key has been assigned
            {   //there's a value in the region
                *location[sequence]+=sct;
            }
            else
            {   //no value in region, make a footprint
                ctarr[datact]=sct;
                seqarr[datact]=sequence;
                location[sequence]=&ctarr[datact]; //assign current key to point to data count


                //assign k substitution "funnel" region to point to this count as well
                for (j=0; j<L; j++)
                {
                    funnelstring=sequence;
                    currchar = nuc2num(sequence[j]);

                    if (datact==10736 && j==13)
                    {
                        cout << "here" << endl;
                        cout << sequence << endl;
                    }

                    for (k=0; k<3; k++)
                    {
                        funnelstring[j]=sub_lookup[currchar][k];

//                    if (datact==10736 && j==13)
//                    {
//                        cout << funnelstring << endl;
//                        cout << location.max_size() << " | " << location.size() << endl;
//                        string asdf;
//                        asdf="AAAAAAAAAAAAAAAA";
//                        location[asdf]=&ctarr[datact]; //still segfaults
//                    }

                        if (location.find(funnelstring) == location.end()) // asks whether this key has been assigned
                        {   //this region is not assigned to another funnel
                            location[funnelstring]=&ctarr[datact]; //LINE THAT CAUSES SIGSEGV
                        }
                    }
                }
                datact++;
                cout << datact << endl;
                if (datact>=currentsize)
                {
                    ctarr.resize(currentsize+buffersize);
                    seqarr.resize(currentsize+buffersize);
                    currentsize+=buffersize;
                }
            }

            getline(seqfile,textct,'\n');
         }
        seqfile.close();
    }

Explorations.

  1. It doesn't matter what key is added, when datact==10736 and j=13, any key added to (unordered_map) location results in a SIGSEGV.
  2. The line of code in question (marked by a comment above) is called many times prior, and operates correctly.
  3. Exchanging unordered_map for map results in the same event, but at a different (higher) count of datact. Still low (between datact==16-35 thousand).
  4. Rewriting this code almost exactly, but with randomly generated 16-character sequences works perfectly, as far as I can tell (no segfaults for datact up to 200,000, didnt test higher).
  5. In the code in (4), it does appear that there is a rehash right around 10000, which could be related or coincidence.

If need be, I can post the datafile that is read.

EDIT: Solved Instead of unordered_map <string,unsigned int*> location, replaced with unordered_map <string,unsigned int> location (value_type is int instead of int*). Now value_type holds the index in the ctarr[]. Runs fine. Thanks!

Upvotes: 1

Views: 3608

Answers (1)

us2012
us2012

Reputation: 16253

Pointers to vector elements can be invalidated when you call vector::resize(). This is because the entire data may have to be moved around in order to find a contiguous memory block that fits the new size. In other words: As soon as you call resize, all your location data suddenly becomes useless garbage.

Possible solutions:

  • Let location store the index of the required element in ctarr as its value rather than a pointer. (This will certainly not change the semantics of your program.)
  • Let location store the actual unsigned int value instead of a pointer. Depending on your program logic and how you change and access this data, this may not be what you want.

Also note that although the segfault occurs in hashtable_policy.h, this bug has nothing to do with the implementation of unordered_map (or vector) - it is entirely your fault for not reading the reference for vector::resize() ;-) : http://www.cplusplus.com/reference/vector/vector/resize/ (section 'Iterator validity')


The other thing I noticed about your code is that you use operator[] for accessing your vector elements. This disables out-of-bounds checking. If I came across an error like yours in my code (hard to trace back because it occurs somewhere far from my erroneous code), my first course of action would be swapping operator[] for vector::at() (actually, I always start with at() and only switch if I can prove beyond reasonable doubt that bounds checking is a performance bottleneck for this particular purpose). This would not have helped with your problem, but is often an invaluable help in spotting mistakes.

Upvotes: 6

Related Questions