Threads, Coro, Anyevent confusion

Question

I am relatively new to perl and even newer to threading in perl. I have a perl script that takes input from 3 different sources. (2 LDAP queries and a file that isn't always there) Because some parts can take longer than others so I decided to use threads and queues. During development, testing individual components of the script worked very well, but after putting it all together the performance seems to degrade.

Basic structure is this 2 threads:(Read file or Read AD entries) -> Queue1 -> 2 threads:(scrub data) -> Queue2 -> 3-4 threads(compare against existing local LDAP entries). Several threads report statistics back to the main script and once all threads are done an email is sent with all the stats and status of that run.

I am using dequeue_nb and I thought that would help but no luck.

The performance hit seems to be in the queues. While looking for tips to improve performance I've run into several articles saying perl threads are no good and to use coro, POE, Anyevent, IO:async, etc.

This doesn't seem like a "event" problem so I didn't think AnyEvent or POE would be the way to go by from what I'm seeing, coros only seems to use one CPU at a time so I'm not sure this would work either. I thought about using a combination of them but then my head started to hurt. Does anyone have any suggestions on how to either fix/troubleshoot my script or any suggestions how to implement one of the other modules?

amon · Accepted Answer

A problem with parallelism is synchronization. It is a performance killer, it is bad, it is to be avoided if possible.

OPs architecture

Lets look at your architecture:

+--------------+--------------+
|   Input 1    |   Input 2    |
+--------------+--------------+
|           QUEUE A           |
+--------------+--------------+
|   Scrub 1    |   Scrub 2    |
+--------------+--------------+
|           QUEUE B           |
+---------+---------+---------+
| Compare | Compare | Compare |
+---------+---------+---------+

Discussion

Queue A has to be synchronized across four threads; Queue B across 5-6. Only one thread can access the Queue at any time, so most of the time your threads will be waiting, not working!

Parallel Pipeline Architecture

A somewhat different architecture could look like this:

+-----------+    +-----------+
|  Input 1  |    |  Input 2  |
+-----------+    +-----------+
| QUEUE  1A |    | QUEUE  2A |
+-----------+    +-----------+
|  Scrub 1  |    |  Scrub 2  |
+-----------+    +-----------+
| QUEUE  1B |    | QUEUE  2B |
+-----+-----+    +-----+-----+
| Cmp | Cmp |    | Cmp | Cmp |
+-----+-----+    +-----+-----+

Discussion

Here, the A Queues are only affiliated with two threads (->less waiting), the B Queues only with three. This architecture should perform faster for similar input size/complexity. If Input 2 were considerably shorter, the whole Pipeline 2 would have run before Pipeline 1 is even half finished. It is, however, far better than Using a single process for each Pipeline.

The Lawn Sprinkler Architecture

Concept

An even better architecture would try to distribute the output of a process across multiple Queues. (The reverse, getting threads fetch their input from multiple queues is bad when a queue is empty.)

Each Queue write should go to a different queue:

  +-----------+   +-----------+
  |  Input 1  |   |  Input 2  |
  +-----------+   +-----------+
        |      \ /      |
  +-----------+   +-----------+
  | QUEUE  1A |   | QUEUE  2A |
  +-----------+   +-----------+
  |  Scrub 1  |   |  Scrub 2  |
  +-----------+   +-----------+
        / | \ \   / / | \      

+-------+-------+-------+-------+
| Q. 1B | Q. 2B | Q. 3B | Q. 4B |
+-------+-------+-------+-------+
|  Cmp  |  Cmp  |  Cmp  |  Cmp  |
+-------+-------+-------+-------+

This makes sure each thread has the same workload, but it cannot make sure that all threads finish at the same time.

Discussion

All Queues are shared among 3 Threads. The problem is that two Threads will block each other when writing to a queue. If the time between Queue write accesses is significantly larger than the write duration, this should be no problem, else the second architecture can be mixed in.

So if this architecture makes sense depends on your exact requirements.

It is slower for evenly sized inputs, but performs better on irregular input.

Appendices

On implementing:

What framework is used is secondary to the architecture. If you only pass around text strings, I strongly advise using pipes. If you have to pass Perl data types or objects, you probably have to embrace the additional overhead of using a real Queue: When adding an unshared variable to a queue, a deep copy has to be made (see @Leon Timmermans answer) in addition to all the synchronization overhead.

On scalability:

Architecture 1 and 3 are not fixed in the number of threads. I strongly suggest using this flexibility to benchmark different compositions. A rule of thumb is that you should use n to 2n threads where n is the number of processors (or hardware threads). This can be seen as a maximal sensible number for the threads of one stage. Above that, you only get a memory penalty and no speedup. A performance saturation point may be reached earlier, when a stage can process the input faster than it is supplied.

Threads, Coro, Anyevent confusion

Answers (2)

OPs architecture

Discussion

Parallel Pipeline Architecture

Discussion

The Lawn Sprinkler Architecture

Concept

Discussion

Appendices

On implementing:

On scalability:

Related Questions