Reputation: 60
i'm trying to use disruptor to process messages. i need two phases of processing. i.e. two groups of handlers working in a worker pool like this (i guess):
disruptor.
handleEventsWithWorkerPool(
firstPhaseHandlers)
.thenHandleEventsWithWorkerPool(
secondPhaseHandlers);
when using the code above, if i put more than one worker in each group, the performance deteriorates. meaning tons of CPU wasted for the exact same amount of work.
i tried to tweak with the ring buffer size (which i already saw has an impact on performance) but in this case it didn't help. so am i doing something wrong, or is this a real problem?
i'm attaching a full demo of the issue.
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;
import com.lmax.disruptor.EventFactory;
import com.lmax.disruptor.EventTranslatorOneArg;
import com.lmax.disruptor.WorkHandler;
import com.lmax.disruptor.dsl.Disruptor;
final class ValueEvent {
private long value;
public long getValue() {
return value;
}
public void setValue(long value) {
this.value = value;
}
public final static EventFactory<ValueEvent> EVENT_FACTORY = new EventFactory<ValueEvent>() {
public ValueEvent newInstance() {
return new ValueEvent();
}
};
}
class MyWorkHandler implements WorkHandler<ValueEvent> {
AtomicLong workDone;
public MyWorkHandler (AtomicLong wd)
{
this.workDone=wd;
}
public void onEvent(final ValueEvent event) throws Exception {
workDone.incrementAndGet();
}
}
class My2ndPahseWorkHandler implements WorkHandler<ValueEvent> {
AtomicLong workDone;
public My2ndPahseWorkHandler (AtomicLong wd)
{
this.workDone=wd;
}
public void onEvent(final ValueEvent event) throws Exception {
workDone.incrementAndGet();
}
}
class MyEventTranslator implements EventTranslatorOneArg<ValueEvent, Long> {
@Override
public void translateTo(ValueEvent event, long sequence, Long value) {
event.setValue(value);
}
}
public class TwoPhaseDisruptor {
static AtomicLong workDone=new AtomicLong(0);
@SuppressWarnings("unchecked")
public static void main(String[] args) {
ExecutorService exec = Executors.newCachedThreadPool();
int numOfHandlersInEachGroup=Integer.parseInt(args[0]);
long eventCount=Long.parseLong(args[1]);
int ringBufferSize=2 << (Integer.parseInt(args[2]));
Disruptor<ValueEvent> disruptor = new Disruptor<ValueEvent>(
ValueEvent.EVENT_FACTORY, ringBufferSize,
exec);
ArrayList<MyWorkHandler> handlers = new ArrayList<MyWorkHandler>();
for (int i = 0; i < numOfHandlersInEachGroup ; i++) {
handlers.add(new MyWorkHandler(workDone));
}
ArrayList<My2ndPahseWorkHandler > phase2_handlers = new ArrayList<My2ndPahseWorkHandler >();
for (int i = 0; i < numOfHandlersInEachGroup; i++) {
phase2_handlers.add(new My2ndPahseWorkHandler(workDone));
}
disruptor
.handleEventsWithWorkerPool(
handlers.toArray(new WorkHandler[handlers.size()]))
.thenHandleEventsWithWorkerPool(
phase2_handlers.toArray(new WorkHandler[phase2_handlers.size()]));
long s = (System.currentTimeMillis());
disruptor.start();
MyEventTranslator myEventTranslator = new MyEventTranslator();
for (long i = 0; i < eventCount; i++) {
disruptor.publishEvent(myEventTranslator, i);
}
disruptor.shutdown();
exec.shutdown();
System.out.println("time spent "+ (System.currentTimeMillis() - s) + " ms");
System.out.println("amount of work done "+ workDone.get());
}
}
try running the above example with 1 thread in each group
1 100000 7
on my computer it gave
time spent 371 ms
amount of work done 200000
Then try it with 4 threads in each group
4 100000 7
which on my computer gave
time spent 9853 ms
amount of work done 200000
during the run the CPU is at 100% utilization
Upvotes: 0
Views: 1612
Reputation: 969
You seem to be false sharing the AtomicLong between the threads/cores. I'll try it out when I have more time later with a demo, however - much better would be to have each WorkHandler with a private variable that each thread owns (either it's own AtomicLong or preferably a plain long).
Update:
If you change your Disruptor line to:
Disruptor<ValueEvent> disruptor = new Disruptor<ValueEvent>(
ValueEvent.EVENT_FACTORY, ringBufferSize,
exec,
com.lmax.disruptor.dsl.ProducerType.SINGLE,
new com.lmax.disruptor.BusySpinWaitStrategy());
You'll get much better results:
jason@debian01:~/code/stackoverflow$ java -cp disruptor-3.1.1.jar:. TwoPhaseDisruptor 4 100000 1024
time spent 2728 ms
amount of work done 200000
I reviewed the code and tried to fix false sharing, but found little improvement. That's when I noticed on my 8core that the CPUs were nowhere near 100% (even for the four-worker test). From this I determined, at least, that a yielding/spinning wait strategy will bring reduced latency if you have CPU to burn.
Just make sure you have at least 8 cores (you'll need 8 for processing, plus one for publishing the messages).
Upvotes: 2