Почему разрушитель медленнее с меньшим кольцевым буфером?

Question

Почему разрушитель медленнее с меньшим кольцевым буфером?

Следуя Руководству по началу работы с Disruptor, я создал минимальный разрушитель с одним производителем и одним потребителем.

Режиссер

import com.lmax.disruptor.RingBuffer;

public class LongEventProducer
{
    private final RingBuffer<LongEvent> ringBuffer;

    public LongEventProducer(RingBuffer<LongEvent> ringBuffer)
    {
        this.ringBuffer = ringBuffer;
    }

    public void onData()
    {
        long sequence = ringBuffer.next();
        try
        {
            LongEvent event = ringBuffer.get(sequence);
        }
        finally
        {
            ringBuffer.publish(sequence);
        }
    }
}

Потребитель (Обратите внимание, что потребитель ничего не делает onEvent)

import com.lmax.disruptor.EventHandler;

public class LongEventHandler implements EventHandler<LongEvent>
{
    public void onEvent(LongEvent event, long sequence, boolean endOfBatch)
    {}
}

Моя цель состояла в том, чтобы протестировать производительность обхода большого кольцевого буфера один раз вместо обхода меньшего кольца несколько раз. В каждом случае общее количество операций (bufferSize Икс rotations) та же. Я обнаружил, что скорость операций в секунду резко упала, когда кольцевой буфер стал меньше.

RingBuffer Size |  Revolutions  | Total Ops   |   Mops/sec

    1048576     |      1        |  1048576    |     50-60

       1024     |      1024     |  1048576    |     8-16

        64      |      16384    |  1048576    |    0.5-0.7

        8       |      131072   |  1048576    |    0.12-0.14

Вопрос: что является причиной значительного снижения производительности, когда размер кольцевого буфера уменьшается, но общее количество итераций фиксировано? Эта тенденция не зависит от WaitStrategy а также Single vs MultiProducer - пропускная способность снижается, но тенденция та же.

Главная (уведомление SingleProducer а также BusySpinWaitStrategy)

import com.lmax.disruptor.BusySpinWaitStrategy;
import com.lmax.disruptor.dsl.Disruptor;
import com.lmax.disruptor.RingBuffer;
import com.lmax.disruptor.dsl.ProducerType;

import java.util.concurrent.Executor;
import java.util.concurrent.Executors;

public class LongEventMainJava{
        static double ONEMILLION = 1000000.0;
        static double ONEBILLION = 1000000000.0;

    public static void main(String[] args) throws Exception {
            // Executor that will be used to construct new threads for consumers
            Executor executor = Executors.newCachedThreadPool();    

            // TUNABLE PARAMS
            int ringBufferSize = 1048576; // 1024, 64, 8
            int rotations = 1; // 1024, 16384, 131702

            // Construct the Disruptor
            Disruptor disruptor = new Disruptor<>(new LongEventFactory(), ringBufferSize, executor, ProducerType.SINGLE, new BusySpinWaitStrategy());

            // Connect the handler
            disruptor.handleEventsWith(new LongEventHandler());

            // Start the Disruptor, starts all threads running
            disruptor.start();

            // Get the ring buffer from the Disruptor to be used for publishing.
            RingBuffer<LongEvent> ringBuffer = disruptor.getRingBuffer();
            LongEventProducer producer = new LongEventProducer(ringBuffer);

            long start = System.nanoTime();
            long totalIterations = rotations * ringBufferSize;
            for (long i = 0; i < totalIterations; i++) {
                producer.onData();
            }
            double duration = (System.nanoTime()-start)/ONEBILLION;
            System.out.println(String.format("Buffersize: %s, rotations: %s, total iterations = %s, duration: %.2f seconds, rate: %.2f Mops/s",
                    ringBufferSize, rotations, totalIterations, duration, totalIterations/(ONEMILLION * duration)));
        }
}

И для запуска вам понадобится тривиальный код Factory

import com.lmax.disruptor.EventFactory;

public class LongEventFactory implements EventFactory<LongEvent>
{
    public LongEvent newInstance()
    {
        return new LongEvent();
    }
}

Работает на ядре i5-2400, 12 ГБ оперативной памяти, Windows 7

Пример вывода

Buffersize: 1048576, rotations: 1, total iterations = 1048576, duration: 0.02 seconds, rate: 59.03 Mops/s

Buffersize: 64, rotations: 16384, total iterations = 1048576, duration: 2.01 seconds, rate: 0.52 Mops/s

3

java performance performance-testing disruptor-pattern lmax

Источник

user4076764 03 июл '17 в 20:06

2 ответа

Решение

Похоже, проблема заключается в этом блоке кода в lmax\disruptor\SingleProducerSequencer

if (wrapPoint > cachedGatingSequence || cachedGatingSequence > nextValue)
        {
            cursor.setVolatile(nextValue);  // StoreLoad fence

            long minSequence;
            while (wrapPoint > (minSequence = Util.getMinimumSequence(gatingSequences, nextValue)))
            {
                waitStrategy.signalAllWhenBlocking();
                LockSupport.parkNanos(1L); // TODO: Use waitStrategy to spin?
            }

            this.cachedValue = minSequence;
        }

В частности, призыв к LockSupport.parkNanos(1L), Это может занять до 15 мс в Windows. Когда производитель достигает конца буфера и ожидает от потребителя, это вызывается.

Во-вторых, когда буфер мал, вероятно, происходит ложное совместное использование RingBuffer. Я предполагаю, что оба эти эффекта находятся в игре.

Наконец, я смог ускорить код с помощью JIT с миллионом вызовов onData() до бенчмаркинга. Это получило лучшее дело > 80Mops/sec, но не удаляла деградацию с усадкой буфера.

0

Источник

user4076764 05 июл '17 в 21:10

Другие вопросы по тегам java performance performance-testing disruptor-pattern lmax

user438154 03 июл '17 в 21:25 2017-07-03 21:25 · Accepted Answer · 2017-07-03 21:25

Когда производитель (и) заполняет кольцевой буфер, он должен ждать, пока события не будут использованы, прежде чем сможет продолжить.

Когда размер вашего буфера точно соответствует количеству элементов, которые вы будете вставлять, производителю никогда не придется ждать. Это никогда не переполнится. Все, что он делает, - это существенно увеличивает счетчик, индекс и публикует данные в кольцевом буфере с этим индексом.

Когда ваш буфер меньше, он по-прежнему просто увеличивает счет и публикует, но делает это быстрее, чем потребитель может потреблять. Поэтому производитель должен ждать, пока элементы не будут использованы, и пространство в кольцевом буфере освободится.