segmentation fault appearing when program parameters are over a certain threshold

Question

I'm writing this fairly big network simulator in C++. I've been regularly testing individual pieces as I was developing them, and after putting everything together it seems to work as long as the load I impose on the simulator is not too big (it's a P2P content distribution simulator, so the more different "contents" I introduce the more data transfers the simulator will have to handle). Anything above a certain threshold of the number of different contents being simulated will result in an abrupt SIGSEGV after several minutes of smooth running. I assumed there was a memory leak that was eventually becoming too large and messing things up, but a valgrind run with the parameters below the threshold terminated flawlessly. However, if I try to run the program with valgrind using a critical value for the content number, after a certain point I start to get memory access errors in functions that previously presented no problems:

==5987== Invalid read of size 8
==5987==    at 0x40524E: Scheduler::advanceClock() (Scheduler.cpp:38)
==5987==    by 0x45BA73: TestRun::execute() (TestRun.cpp:73)
==5987==    by 0x45522B: main (CDSim.cpp:131)
==5987==  Address 0x2e63bc70 is 0 bytes inside a block of size 32 free'd
==5987==    at 0x4C2A4BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5987==    by 0x405487: Scheduler::advanceClock() (Scheduler.cpp:69)
==5987==    by 0x45BA73: TestRun::execute() (TestRun.cpp:73)
==5987==    by 0x45522B: main (CDSim.cpp:131)
==5987==
==5987== Invalid read of size 4
==5987==    at 0x40584E: Request::getSimTime() const (Event.hpp:45)
==5987==    by 0x40525C: Scheduler::advanceClock() (Scheduler.cpp:38)
==5987==    by 0x45BA73: TestRun::execute() (TestRun.cpp:73)
==5987==    by 0x45522B: main (CDSim.cpp:131)
==5987==  Address 0x2e63bc78 is 8 bytes inside a block of size 32 free'd
==5987==    at 0x4C2A4BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5987==    by 0x405487: Scheduler::advanceClock() (Scheduler.cpp:69)
==5987==    by 0x45BA73: TestRun::execute() (TestRun.cpp:73)
==5987==    by 0x45522B: main (CDSim.cpp:131)
==5987==

I know it might be hard to give an answer without seeing the whole code, but is there a "high-level" hint on what might be going on here? I don't understand why a function that seems to work normally suddenly starts misbehaving. Is there something obvious that I'm missing maybe?

The incriminated line in the previous valgrind log is if (nextEvent->getSimTime() < this->getSimTime()) in the following block:

bool Scheduler::advanceClock() {
  if (pendingEvents.size() == 0) {
    std::cerr << "WARNING: Scheduler::advanceClock() - Empty event queue before "
        "reaching the termination event" << std::endl;
    return false;
  }
  const Event* nextEvent = pendingEvents.top();
  // Check that the event is not scheduled in the past
  if (nextEvent->getSimTime() < this->getSimTime()) {
    std::cerr << "Scheduler::advanceClock() - Event scheduled in the past!" << 
        std::endl;
    std::cerr << "Simulation time: " << this->getSimTime()
        << ", event time: " << nextEvent->getSimTime()
        << std::endl;
    exit(ERR_EVENT_IN_THE_PAST);
  }
  // Update the clock with the current event time (>= previous time)
  this->setSimTime(nextEvent->getSimTime());
  ...

where pendingEvents is a boost::heap::binomial_heap.

manuhalo · Accepted Answer

I finally found what the problem was. When the event was completed and it needed to be removed from the list, my code went something like this:

...
// Data transfer completed, remove event from queue
// Notify the oracle, which will update the cache mapping and free resources
// in the topology
oracle->notifyCompletedFlow(nextEvent, this);
// Remove flow from top of the queue
pendingEvents.pop();
handleMap.erase(nextEvent);
delete nextEvent;
return true;

The problem was that oracle->notifyCompletedFlow() invoked some methods on the scheduler to dynamically update the priority of scheduled events (e.g. to react to a change in the available bandwidth in the network), and thus by the time I removed the top of the queue with pendingEvents.pop() in some cases I was popping a different event and leaving the deleted nextEvent in there. By popping the queue before invoking the oracle the problem sorted itself out.

I apologize for having left out pieces of code that might have led to a quicker answer, I'll try to learn from my mistake :) Thanks for pointing me in the right direction.

segmentation fault appearing when program parameters are over a certain threshold

Answers (2)

Related Questions