Службам nova периодически не удается подключиться к rabbitmq с ошибкой тайм-аута
Я использую Openstack Pike с RabbitMQ 3.6.5 на Erlang 18.3.4.4 в качестве очереди сообщений. Когда службы nova перезапускаются, одна из служб nova (в основном, nova-scheduler) будет иметь ошибку в журналах, как показано ниже. В нем говорится, что служба не может подключиться к rabbitmq, и ошибка появляется снова и снова, пока не будет перезапущена. Запуск экземпляра также завершается неудачно из-за этой ошибки. Службы openstack-nova- start успешно завершены.
2018-05-31 06:19:44.737 5510 INFO nova.service [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Starting scheduler node (version 16.1.1-1.el7)
2018-05-31 06:19:44.745 5510 DEBUG nova.service [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Creating RPC server for service scheduler start /usr/lib/python2.7/site-packages/nova/service.py:179
2018-05-31 06:19:44.749 5510 DEBUG oslo.messaging._drivers.pool [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Pool creating new connection create /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/pool.py:143
2018-05-31 06:19:44.753 5510 DEBUG oslo.messaging._drivers.impl_rabbit [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] [57700d60-c761-4072-b4a3-d33706864086] Connecting to AMQP server on localhost:5672 __init__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:597
2018-05-31 06:19:44.761 5510 DEBUG nova.scheduler.host_manager [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Found 2 cells: 00000000-0000-0000-0000-000000000000, 4963dd3d-246d-4f9b-9d4b-0cb1ab3032ae _load_cells /usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py:642
2018-05-31 06:19:44.761 5510 DEBUG nova.scheduler.host_manager [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] START:_async_init_instance_info _async_init_instance_info /usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py:421
2018-05-31 06:19:44.763 5510 DEBUG oslo_concurrency.lockutils [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Lock "00000000-0000-0000-0000-000000000000" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:270
2018-05-31 06:19:44.764 5510 DEBUG oslo_concurrency.lockutils [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Lock "00000000-0000-0000-0000-000000000000" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282
2018-05-31 06:19:49.761 5510 DEBUG oslo.messaging._drivers.impl_rabbit [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] [57700d60-c761-4072-b4a3-d33706864086] Received recoverable error from kombu: on_error /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:744
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 494, in _ensured
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return fun(*args, **kwargs)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 569, in __call__
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self.revive(self.connection.default_channel)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 819, in default_channel
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self.connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 802, in connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self._connection = self._establish_connection()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 757, in _establish_connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit conn = self.transport.establish_connection()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit conn.connect()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 300, in connect
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self.drain_events(timeout=self.connect_timeout)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return self.blocking_read(timeout)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 468, in blocking_read
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit frame = self.transport.read_frame()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 237, in read_frame
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit frame_header = read(7, True)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 377, in _read
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit s = recv(n - len(rbuf))
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 354, in recv
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return self._recv_loop(self.fd.recv, b'', bufsize, flags)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 348, in _recv_loop
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self._read_trampoline()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 319, in _read_trampoline
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit timeout_exc=socket.timeout("timed out"))
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 203, in _trampoline
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit mark_as_closed=self._mark_as_closed)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 162, in trampoline
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return hub.switch()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return self.greenlet.switch()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit timeout: timed out
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit
2018-05-31 06:19:49.763 5510 ERROR oslo.messaging._drivers.impl_rabbit [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] [57700d60-c761-4072-b4a3-d33706864086] AMQP server on 127.0.0.1:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
Журналы rabbitmq показывают ошибки принятия и закрытия TCP.
=INFO REPORT==== 31-May-2018::06:39:41 === accepting AMQP connection <0.957.0> (127.0.0.1:33456 -> 127.0.0.1:5672)
=ERROR REPORT==== 31-May-2018::06:40:11 === closing AMQP connection <0.948.0> (127.0.0.1:33450 -> 127.0.0.1:5672): {handshake_timeout,frame_header}
В ходе расследования я обнаружил, что ошибка rabbitmq указывает на то, что соответствующий заголовок кадра не был отправлен клиентом и, следовательно, соединение было закрыто. Но я не могу найти способ узнать, в чем проблема с заголовками фреймов или из-за некоторых других ошибок. Может кто-нибудь, пожалуйста, пролить свет на то, как я могу дальше отлаживать это, пожалуйста? Благодарю.
1 ответ
Я предполагаю, что это проблема с TCP/буфером и т. д. Попробуйте увеличить буфер памяти rabbitmq и количество TCP-соединений.
$ sysctl -w net.ipv4.tcp_sack=0
$ эхо 0 > /proc/sys/net/ipv4/tcp_sack