Reputation: 359
Segmentation fault occurs when datact=10736 when trying to insert into an unordered_map (see commented line for which call throws the fault). See below for attempted fixes.
When SIGSEGV is thrown it points me to line 764 of hashtable_policy.h
INPUT: data file with column1=counts, column2=16-character strings
PURPOSE: cluster 16-character sequences by adding all the counts of 1-substitution different sequences together. First sequence seen is "origin" by which all its 1-substitution friends are identified.
PSEUDOCODE: for each line in file:
read count, read sequence.
if sequence key_value exists in hash 'location' (type unordered_map), add current count;
otherwise make a new key_value, have it point to the count here, and assign all 1-substitution sequences to also point to this count.
Code:
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <cmath>
#include <vector>
#include <unordered_map>
#include <map>
#include "matrix.h"
using namespace std;
int nuc2num(char currchar)
{ // returns 0,1,2,3 for A,C,T,G respectively
int outnum;
if (currchar=='A')
{
outnum=0;
}
else if (currchar=='C')
{
outnum=1;
}
else if (currchar=='T')
{
outnum=2;
}
else
{
outnum=3;
}
return outnum;
}
int main(int argc, char* argv[])
{
//command line arguments
// arg1=filename, arg2=barcode sequence length, arg3=#mismatches permitted
//input handling
// file format: column 1 | column 2
// counts | sequence [int | string]
string filename;
string funnelstring;
// define lookup matrix; rows=ACTG, cols = ACTG without row element
Matrix <char> sub_lookup(4,3);
sub_lookup[0][0] = 'C';
sub_lookup[0][1] = 'T';
sub_lookup[0][2] = 'G';
sub_lookup[1][0] = 'A';
sub_lookup[1][1] = 'T';
sub_lookup[1][2] = 'G';
sub_lookup[2][0] = 'A';
sub_lookup[2][1] = 'C';
sub_lookup[2][2] = 'G';
sub_lookup[3][0] = 'A';
sub_lookup[3][1] = 'C';
sub_lookup[3][2] = 'T';
int L,k;
int j=0;
const int buffersize=10000;
int currentsize=buffersize;
int datact=0;
int currchar;
vector <unsigned int> ctarr(buffersize);
vector <string> seqarr(buffersize);
filename=argv[1];
L=atoi(argv[2]);
k=atoi(argv[3]);
unsigned int sct;
int substrlen;
string sequence,textct;
ifstream seqfile (filename.c_str());
//map <string,unsigned int*> location;
unordered_map <string,unsigned int*> location;
if (seqfile.is_open())
{
getline(seqfile,textct,'\n');
while (textct != "")
{
sct=atoi(textct.c_str());
substrlen=textct.length();
//cout << textct << endl;
sequence=textct.substr(substrlen-L,L);
//cout << sequence << endl;
//is there an associated guy?
if (location.find(sequence) != location.end()) //just asks whether this key has been assigned
{ //there's a value in the region
*location[sequence]+=sct;
}
else
{ //no value in region, make a footprint
ctarr[datact]=sct;
seqarr[datact]=sequence;
location[sequence]=&ctarr[datact]; //assign current key to point to data count
//assign k substitution "funnel" region to point to this count as well
for (j=0; j<L; j++)
{
funnelstring=sequence;
currchar = nuc2num(sequence[j]);
if (datact==10736 && j==13)
{
cout << "here" << endl;
cout << sequence << endl;
}
for (k=0; k<3; k++)
{
funnelstring[j]=sub_lookup[currchar][k];
// if (datact==10736 && j==13)
// {
// cout << funnelstring << endl;
// cout << location.max_size() << " | " << location.size() << endl;
// string asdf;
// asdf="AAAAAAAAAAAAAAAA";
// location[asdf]=&ctarr[datact]; //still segfaults
// }
if (location.find(funnelstring) == location.end()) // asks whether this key has been assigned
{ //this region is not assigned to another funnel
location[funnelstring]=&ctarr[datact]; //LINE THAT CAUSES SIGSEGV
}
}
}
datact++;
cout << datact << endl;
if (datact>=currentsize)
{
ctarr.resize(currentsize+buffersize);
seqarr.resize(currentsize+buffersize);
currentsize+=buffersize;
}
}
getline(seqfile,textct,'\n');
}
seqfile.close();
}
Explorations.
datact==10736
and j=13
, any
key added to (unordered_map) location results in a SIGSEGV.datact
==16-35
thousand).If need be, I can post the datafile that is read.
EDIT: Solved
Instead of unordered_map <string,unsigned int*> location
, replaced with unordered_map <string,unsigned int> location
(value_type is int instead of int*). Now value_type holds the index in the ctarr[]. Runs fine. Thanks!
Upvotes: 1
Views: 3608
Reputation: 16253
Pointers to vector
elements can be invalidated when you call vector::resize()
. This is because the entire data may have to be moved around in order to find a contiguous memory block that fits the new size. In other words: As soon as you call resize
, all your location
data suddenly becomes useless garbage.
Possible solutions:
location
store the index of the required element in ctarr
as its value rather than a pointer. (This will certainly not change the semantics of your program.)location
store the actual unsigned int
value instead of a pointer. Depending on your program logic and how you change and access this data, this may not be what you want.Also note that although the segfault occurs in hashtable_policy.h
, this bug has nothing to do with the implementation of unordered_map
(or vector
) - it is entirely your fault for not reading the reference for vector::resize()
;-) : http://www.cplusplus.com/reference/vector/vector/resize/ (section 'Iterator validity')
The other thing I noticed about your code is that you use operator[]
for accessing your vector
elements. This disables out-of-bounds checking. If I came across an error like yours in my code (hard to trace back because it occurs somewhere far from my erroneous code), my first course of action would be swapping operator[]
for vector::at()
(actually, I always start with at()
and only switch if I can prove beyond reasonable doubt that bounds checking is a performance bottleneck for this particular purpose). This would not have helped with your problem, but is often an invaluable help in spotting mistakes.
Upvotes: 6