Тайм-аут Кассандры во время запроса на запись в последовательности QUORUM
Я использую приложение Lagom 1.4 с бэкэндом Cassandra 3.11.2, развернутым в AWS через Kubernetes. Есть две копии Кассандры. Все работает на экземплярах i3xlarge с томами 4G EBS, подключенными для хранения Cassandra к каждой реплике (в режиме ReadWriteOnce).
Я получаю случайные исключения для "Тайм-аут Кассандры во время запроса записи при согласовании QUORUM":
2018-04-04T18:42:19.642Z [error] akka.stream.scaladsl.RestartWithBackoffSource [sourceThread=application-akka.actor.default-dispatcher-3, akkaSource=RestartWithBackoffSource(akka://application), sourceActorSystem=application, akkaTimestamp=18:42:19.637UTC] - Restarting graph due to failure
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (1 replica were required but only 0 acknowledged the write)
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:503)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:462)
at akka.persistence.cassandra.package$ListenableFutureConverter$$anon$2$$anonfun$run$2.apply(package.scala:25)
at scala.util.Try$.apply(Try.scala:191)
at akka.persistence.cassandra.package$ListenableFutureConverter$$anon$2.run(package.scala:25)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (1 replica were required but only 0 acknowledged the write)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:100)
at com.datastax.driver.core.Responses$Error.asException(Responses.java:134)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:507)
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1075)
at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:998)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:146)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency QUORUM (1 replica were required but only 0 acknowledged the write)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:60)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:38)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:289)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:269)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
... 22 common frames omitted
У меня есть щедрая стратегия тайм-аута записи и периодической фиксации журнала:
- name: WRITE_REQUEST_TIMEOUT_IN_MS
value: "20000"
- name: COMMITLOG_SYNC
value: "periodic"
- name: COMMITLOG_SYNC_PERIOD_IN_MS
value: "10000"
Кассандра в целом чувствует себя вялой. В журналах Кассандры я вижу много таких вещей:
INFO 18:38:49 G1 Young Generation GC in 405ms. G1 Eden Space: 120586240 -> 0; G1 Old Gen: 393197456 -> 394246032; G1 Survivor Space: 5242880 -> 11534336;
INFO 18:38:51 G1 Young Generation GC in 301ms. G1 Eden Space: 41943040 -> 0; G1 Old Gen: 394246048 -> 405698464; G1 Survivor Space: 11534336 -> 4194304;
INFO 18:39:06 G1 Young Generation GC in 292ms. G1 Eden Space: 46137344 -> 0; G1 Old Gen: 307863800 -> 282697960; G1 Survivor Space: 7340032 -> 6291456;
INFO 18:39:08 G1 Young Generation GC in 304ms. G1 Eden Space: 47185920 -> 0; G1 Old Gen: 282697960 -> 259629304; G1 Survivor Space: 6291456 -> 7340032;
INFO 18:39:10 G1 Young Generation GC in 204ms. G1 Eden Space: 46137344 -> 0; G1 Old Gen: 260677912 -> 259105040; G1 Survivor Space: 7340032 -> 6291456;
INFO 18:39:35 G1 Young Generation GC in 203ms. G1 Eden Space: 156237824 -> 0; G1 Old Gen: 299370920 -> 300419496; G1 Survivor Space: 9437184 -> 14680064;
INFO 18:39:43 G1 Young Generation GC in 301ms. G1 Eden Space: 117440512 -> 0; G1 Old Gen: 306153944 -> 317825504; G1 Survivor Space: 15728640 -> 9437184;
INFO 18:39:57 G1 Young Generation GC in 290ms. G1 Eden Space: 158334976 -> 0; G1 Old Gen: 325588488 -> 330408464; G1 Survivor Space: 13631488 -> 14680064;
INFO 18:40:00 G1 Young Generation GC in 208ms. G1 Eden Space: 101711872 -> 0; G1 Old Gen: 331457072 -> 341942832; G1 Survivor Space: 14680064 -> 9437184;
INFO 18:40:10 G1 Young Generation GC in 211ms. G1 Eden Space: 328204288 -> 0; G1 Old Gen: 345088624 -> 346137200; G1 Survivor Space: 9437184 -> 19922944;
WARN 18:40:14 Out of 31 commit log syncs over the past 296.80s with average duration of 336.74ms, 31 have exceeded the configured commit interval by an average of 111.32ms
INFO 18:40:14 G1 Young Generation GC in 208ms. G1 Eden Space: 135266304 -> 0; G1 Old Gen: 346137200 -> 368059496; G1 Survivor Space: 19922944 -> 7340032;
INFO 18:40:31 G1 Young Generation GC in 493ms. G1 Eden Space: 208666624 -> 0; G1 Old Gen: 367535232 -> 374973064; G1 Survivor Space: 16777216 -> 12582912;
INFO 18:40:57 G1 Young Generation GC in 300ms. G1 Eden Space: 68157440 -> 0; G1 Old Gen: 389128912 -> 402026704; G1 Survivor Space: 15728640 -> 5242880;
INFO 18:41:08 G1 Young Generation GC in 203ms. G1 Eden Space: 39845888 -> 0; G1 Old Gen: 352218840 -> 308988128; G1 Survivor Space: 13631488 -> 4194304;
INFO 18:41:10 G1 Young Generation GC in 204ms. G1 Eden Space: 49283072 -> 0; G1 Old Gen: 308988128 -> 277979360; G1 Survivor Space: 4194304 -> 5242880;
INFO 18:41:12 G1 Young Generation GC in 304ms. G1 Eden Space: 48234496 -> 0; G1 Old Gen: 279027968 -> 249667840; G1 Survivor Space: 5242880 -> 7340032;
INFO 18:41:14 G1 Young Generation GC in 298ms. G1 Eden Space: 46137344 -> 0; G1 Old Gen: 249667840 -> 226074872; G1 Survivor Space: 7340032 -> 5242880;
INFO 18:41:16 G1 Young Generation GC in 202ms. G1 Eden Space: 48234496 -> 0; G1 Old Gen: 226074872 -> 224502016;
INFO 18:41:43 G1 Young Generation GC in 205ms. G1 Eden Space: 225443840 -> 0; G1 Old Gen: 258939328 -> 260512184; G1 Survivor Space: 10485760 -> 15728640;
INFO 18:42:07 G1 Young Generation GC in 202ms. G1 Eden Space: 222298112 -> 0; G1 Old Gen: 289514048 -> 291086904; G1 Survivor Space: 13631488 -> 17825792;
INFO 18:42:17 G1 Young Generation GC in 296ms. G1 Eden Space: 177209344 -> 0; G1 Old Gen: 292135512 -> 303354464; G1 Survivor Space: 23068672 -> 16777216;
INFO 18:42:20 G1 Young Generation GC in 205ms. G1 Eden Space: 106954752 -> 0; G1 Old Gen: 304403072 -> 316252808; G1 Survivor Space: 16777216 -> 9437184;
INFO 18:42:31 G1 Young Generation GC in 393ms. G1 Eden Space: 196083712 -> 0; G1 Old Gen: 316777608 -> 317301904; G1 Survivor Space: 13631488 -> 17825792;
INFO 18:42:33 G1 Young Generation GC in 206ms. G1 Eden Space: 35651584 -> 0; G1 Old Gen: 317301904 -> 333834384; G1 Survivor Space: 17825792 -> 3145728;
И даже это:
WARN 18:40:53 Out of 0 commit log syncs over the past 0.00s with average duration of Infinityms, 1 have exceeded the configured commit interval by an average of 171.42ms
Хуже всего то, что возможная последовательность нарушена. Теперь в реестре есть объекты, которые не были переведены на сторону чтения даже процессорами. Это должно быть там, где произошли тайм-ауты.
Почему сборка мусора происходит так часто, так долго? Как вы думаете, это основное влияние на производительность? Что еще я могу пропустить?