Тайм-аут Кассандры во время запроса на запись в последовательности QUORUM

Я использую приложение Lagom 1.4 с бэкэндом Cassandra 3.11.2, развернутым в AWS через Kubernetes. Есть две копии Кассандры. Все работает на экземплярах i3xlarge с томами 4G EBS, подключенными для хранения Cassandra к каждой реплике (в режиме ReadWriteOnce).

Я получаю случайные исключения для "Тайм-аут Кассандры во время запроса записи при согласовании QUORUM":

2018-04-04T18:42:19.642Z [error] akka.stream.scaladsl.RestartWithBackoffSource [sourceThread=application-akka.actor.default-dispatcher-3, akkaSource=RestartWithBackoffSource(akka://application), sourceActorSystem=application, akkaTimestamp=18:42:19.637UTC] - Restarting graph due to failure
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (1 replica were required but only 0 acknowledged the write)
    at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:503)
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:462)
    at akka.persistence.cassandra.package$ListenableFutureConverter$$anon$2$$anonfun$run$2.apply(package.scala:25)
    at scala.util.Try$.apply(Try.scala:191)
    at akka.persistence.cassandra.package$ListenableFutureConverter$$anon$2.run(package.scala:25)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (1 replica were required but only 0 acknowledged the write)
    at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:100)
    at com.datastax.driver.core.Responses$Error.asException(Responses.java:134)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:507)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1075)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:998)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:146)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (1 replica were required but only 0 acknowledged the write)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:60)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:38)
    at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:289)
    at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:269)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
    ... 22 common frames omitted

У меня есть щедрая стратегия тайм-аута записи и периодической фиксации журнала:

      - name: WRITE_REQUEST_TIMEOUT_IN_MS
        value: "20000"
      - name: COMMITLOG_SYNC
        value: "periodic"
      - name: COMMITLOG_SYNC_PERIOD_IN_MS
        value: "10000"

Кассандра в целом чувствует себя вялой. В журналах Кассандры я вижу много таких вещей:

INFO  18:38:49 G1 Young Generation GC in 405ms.  G1 Eden Space: 120586240 -> 0; G1 Old Gen: 393197456 -> 394246032; G1 Survivor Space: 5242880 -> 11534336; 
INFO  18:38:51 G1 Young Generation GC in 301ms.  G1 Eden Space: 41943040 -> 0; G1 Old Gen: 394246048 -> 405698464; G1 Survivor Space: 11534336 -> 4194304; 
INFO  18:39:06 G1 Young Generation GC in 292ms.  G1 Eden Space: 46137344 -> 0; G1 Old Gen: 307863800 -> 282697960; G1 Survivor Space: 7340032 -> 6291456; 
INFO  18:39:08 G1 Young Generation GC in 304ms.  G1 Eden Space: 47185920 -> 0; G1 Old Gen: 282697960 -> 259629304; G1 Survivor Space: 6291456 -> 7340032; 
INFO  18:39:10 G1 Young Generation GC in 204ms.  G1 Eden Space: 46137344 -> 0; G1 Old Gen: 260677912 -> 259105040; G1 Survivor Space: 7340032 -> 6291456; 
INFO  18:39:35 G1 Young Generation GC in 203ms.  G1 Eden Space: 156237824 -> 0; G1 Old Gen: 299370920 -> 300419496; G1 Survivor Space: 9437184 -> 14680064; 
INFO  18:39:43 G1 Young Generation GC in 301ms.  G1 Eden Space: 117440512 -> 0; G1 Old Gen: 306153944 -> 317825504; G1 Survivor Space: 15728640 -> 9437184; 
INFO  18:39:57 G1 Young Generation GC in 290ms.  G1 Eden Space: 158334976 -> 0; G1 Old Gen: 325588488 -> 330408464; G1 Survivor Space: 13631488 -> 14680064; 
INFO  18:40:00 G1 Young Generation GC in 208ms.  G1 Eden Space: 101711872 -> 0; G1 Old Gen: 331457072 -> 341942832; G1 Survivor Space: 14680064 -> 9437184; 
INFO  18:40:10 G1 Young Generation GC in 211ms.  G1 Eden Space: 328204288 -> 0; G1 Old Gen: 345088624 -> 346137200; G1 Survivor Space: 9437184 -> 19922944; 
WARN  18:40:14 Out of 31 commit log syncs over the past 296.80s with average duration of 336.74ms, 31 have exceeded the configured commit interval by an average of 111.32ms
INFO  18:40:14 G1 Young Generation GC in 208ms.  G1 Eden Space: 135266304 -> 0; G1 Old Gen: 346137200 -> 368059496; G1 Survivor Space: 19922944 -> 7340032; 
INFO  18:40:31 G1 Young Generation GC in 493ms.  G1 Eden Space: 208666624 -> 0; G1 Old Gen: 367535232 -> 374973064; G1 Survivor Space: 16777216 -> 12582912; 
INFO  18:40:57 G1 Young Generation GC in 300ms.  G1 Eden Space: 68157440 -> 0; G1 Old Gen: 389128912 -> 402026704; G1 Survivor Space: 15728640 -> 5242880; 
INFO  18:41:08 G1 Young Generation GC in 203ms.  G1 Eden Space: 39845888 -> 0; G1 Old Gen: 352218840 -> 308988128; G1 Survivor Space: 13631488 -> 4194304; 
INFO  18:41:10 G1 Young Generation GC in 204ms.  G1 Eden Space: 49283072 -> 0; G1 Old Gen: 308988128 -> 277979360; G1 Survivor Space: 4194304 -> 5242880; 
INFO  18:41:12 G1 Young Generation GC in 304ms.  G1 Eden Space: 48234496 -> 0; G1 Old Gen: 279027968 -> 249667840; G1 Survivor Space: 5242880 -> 7340032; 
INFO  18:41:14 G1 Young Generation GC in 298ms.  G1 Eden Space: 46137344 -> 0; G1 Old Gen: 249667840 -> 226074872; G1 Survivor Space: 7340032 -> 5242880; 
INFO  18:41:16 G1 Young Generation GC in 202ms.  G1 Eden Space: 48234496 -> 0; G1 Old Gen: 226074872 -> 224502016; 
INFO  18:41:43 G1 Young Generation GC in 205ms.  G1 Eden Space: 225443840 -> 0; G1 Old Gen: 258939328 -> 260512184; G1 Survivor Space: 10485760 -> 15728640; 
INFO  18:42:07 G1 Young Generation GC in 202ms.  G1 Eden Space: 222298112 -> 0; G1 Old Gen: 289514048 -> 291086904; G1 Survivor Space: 13631488 -> 17825792; 
INFO  18:42:17 G1 Young Generation GC in 296ms.  G1 Eden Space: 177209344 -> 0; G1 Old Gen: 292135512 -> 303354464; G1 Survivor Space: 23068672 -> 16777216; 
INFO  18:42:20 G1 Young Generation GC in 205ms.  G1 Eden Space: 106954752 -> 0; G1 Old Gen: 304403072 -> 316252808; G1 Survivor Space: 16777216 -> 9437184; 
INFO  18:42:31 G1 Young Generation GC in 393ms.  G1 Eden Space: 196083712 -> 0; G1 Old Gen: 316777608 -> 317301904; G1 Survivor Space: 13631488 -> 17825792; 
INFO  18:42:33 G1 Young Generation GC in 206ms.  G1 Eden Space: 35651584 -> 0; G1 Old Gen: 317301904 -> 333834384; G1 Survivor Space: 17825792 -> 3145728; 

И даже это:

WARN  18:40:53 Out of 0 commit log syncs over the past 0.00s with average duration of Infinityms, 1 have exceeded the configured commit interval by an average of 171.42ms

Хуже всего то, что возможная последовательность нарушена. Теперь в реестре есть объекты, которые не были переведены на сторону чтения даже процессорами. Это должно быть там, где произошли тайм-ауты.

Почему сборка мусора происходит так часто, так долго? Как вы думаете, это основное влияние на производительность? Что еще я могу пропустить?

0 ответов

Другие вопросы по тегам