Jakub M.
Jakub M.

Reputation: 33847

malloc / new lock and multithreading

How should I use new in a multithread environment?

Precisely: I have a piece of code that I run with 40 threads. Each thread invokes new a few times. I noticed that performance drops, probably because threads lock in new (significant time is spent in __lll_lock_wait_parallel and __lll_unlock_wait_parallel). What is the best alternative to new / delete I can use?

Related:

Upvotes: 2

Views: 3635

Answers (6)

sam
sam

Reputation: 1401

i think you should using memory pool . allocate all memory you need (if size is Fix) at the first time when your project started and let the arrays the memory that they need from the first array you allocated .

Upvotes: 2

Since nobody mentioned it, I might also suggest trying to use Boehm's conservative garbage collector; this means using new(gc) instead of new, GC_malloc instead of malloc and don't bother about free-ing or delete-ing memory objects. A couple of years ago, I measured GC_malloc versus malloc, it was a bit slower (perhaps 25µs for GC_malloc versus 22µs for system malloc).

I have no idea of the performance of Boehm's GC in multi-threaded usage (but I do know it can be used in multi-threaded applications).

Boehm's GC has the advantage that you should not care about free-ing your data.

Upvotes: 0

Malkocoglu
Malkocoglu

Reputation: 2611

1st, do you really have to "new" that thing ? Why not use a local variable or a per-thread heap object.

2nd, have a look at http://en.wikipedia.org/wiki/Thread-local_storage if your development environment supports it...

Upvotes: 1

Martin James
Martin James

Reputation: 24877

I tend to use object pools in servers and other such apps that are characterized by continual and frequent allocation and release of large numbers of a few sets of objects, (in servers - socket, buffer and buffer-collection classes). The pools are queues, created at startup with an appropriate number of instances pushed on, (eg. my server - 24000 sockets, 48000 collections and an array of 7 pools of buffers of varying size/count). Popping an object instance off a queue and pushing it back on is much quicker than new/delete, even if the pool queue has a lock because it is shared across the threads, (the smaller the lock span, the smaller the chance of contention). My pooled-object class, (from which all the sockets etc. are inherited), has a private 'myPool' member, (loaded at startup), and a 'release()' method with no parameters & so any buffer is easily and correctly returned to its own pool. There are issues:

1) Ctor and dtor are not called upon allocate/release & so allocated objects contain all the gunge left over from their last use. This can occasionally be useful, (eg. re-useable socket objects), but generally means that care needs to be taken over, say, the initial state of booleans, value of int's etc.

2) A pool per thread has the greatest performance improvement potential - no locking required, but in systems where the loading on each thread is intermittent, ths can be an object waste. I never seem to be able to get away with this, mainly because I use pooled objects for inter-thread comms and so release () has to be thread-safe anyway.

3) Elimination of 'false sharing' on shared pools can be awkward - each instance should be initially 'newed' so as to exclusively use up an integer number of cache pages. At least this only has to be done once at startup.

4) If the system is to be resilient upon a pool running out, either more objects need to be allocated to add to the pool when needed, (the pool size is then creeping up), or a producer-consumer queue can be used so that threads block on the pool until objects are released, (P-C queues are slower because of the condvar/semaphore/whatever for waiting threads to block on, also threads that allocate before releasing can deadlock on an empty pool).

5) Monitoring of the pool levels during development is required so that object leakages and double-releases can be detected. Code/data can be added to the objects/pools to detect such errors as they happen but this compromises performance.

Upvotes: 1

Will
Will

Reputation: 75665

Even if you are using the new operator, its using malloc underneath to do the allocation and deallocation. The focus should be on the allocator and not the API used to reach it in these circumstances.

TCMalloc is a malloc created at Google specifically for good performance in a multi-threading environment. It is part of google-perf-tools.

Another malloc you might look at is Hoard. It has much the same aims as TCMalloc.

Upvotes: 6

cnicutar
cnicutar

Reputation: 182664

I don't know about "the best", but I would try a few things:

  • Reduce the frequency of allocations / frees (might be hard). Just waste memory (but don't leak) if it improves performance

  • Roll my own, per-thread allocator and always alloc / free from the same thread using mmap for the real memory

To roll your own primitive allocator:

  • Use mmap to obtain a large chunk of memory from the OS
  • Use a data structure (linked list, tree etc) to keep track of free and used blocks
  • Never free data allocated by another thread

I don't consider this trivial to do but if done right it could improve performance. The hairiest part is by far keeping track of the allocations, preventing fragmentation etc.

A simple implementation is provided in "The C Programming Language", near the end of the book (but it uses brk IIRC).

Upvotes: 5

Related Questions