MongoDB Primary не может вернуться

У меня есть набор реплик члена MongoDB 3, работающий в Windows. Когда основной сервер (S1) выходит из строя, вторичный сервер выбирается правильно. Когда основной сервер возвращается, член реплики остается в недопустимом состоянии:

     {
            "state" : 10,
            "stateStr" : "REMOVED",
            "uptime" : 111,
            "optime" : Timestamp(1448462710, 6),
            "optimeDate" : ISODate("2015-11-25T14:45:10Z"),
            "ok" : 0,
            "errmsg" : "Our replica set config is invalid or we are not a member of it",
            "code" : 93
     }

После этого вторичный сервер продолжает переключаться между первичным и вторичным каждые несколько секунд, что делает мое приложение нестабильным.

Единственный способ вернуть основной сервер - это выполнить rs.reconfig(c).

Я не мог найти ничего плохого в файлах конфигурации.

Любая помощь будет оценена.

ОБНОВЛЕНИЕ: Вот текущая конфигурация:

{
    "_id" : "companyName",
    "version" : 32593,
    "protocolVersion" : NumberLong(1),
    "members" : [
            {
                    "_id" : 1,
                    "host" : "arb.companyName.com:40000",
                    "arbiterOnly" : true,
                    "buildIndexes" : true,
                    "hidden" : false,
                    "priority" : 1,
                    "tags" : {

                    },
                    "slaveDelay" : NumberLong(0),
                    "votes" : 1
            },
            {
                    "_id" : 2,
                    "host" : "m3.companyName.com:40000",
                    "arbiterOnly" : false,
                    "buildIndexes" : true,
                    "hidden" : false,
                    "priority" : 11,
                    "tags" : {

                    },
                    "slaveDelay" : NumberLong(0),
                    "votes" : 1
            },
            {
                    "_id" : 4,
                    "host" : "m2.companyName.com:40000",
                    "arbiterOnly" : false,
                    "buildIndexes" : true,
                    "hidden" : false,
                    "priority" : 3,
                    "tags" : {

                    },
                    "slaveDelay" : NumberLong(0),
                    "votes" : 1
            }
    ],
    "settings" : {
            "chainingAllowed" : true,
            "heartbeatIntervalMillis" : 2000,
            "heartbeatTimeoutSecs" : 10,
            "electionTimeoutMillis" : 10000,
            "getLastErrorModes" : {

            },
            "getLastErrorDefaults" : {
                    "w" : 1,
                    "wtimeout" : 0
            },
            "replicaSetId" : ObjectId("573dfcd0e8ae6154ff80c50d")
    }
}

Должен ли я использовать IP-адреса, а не имена хостов?

ОБНОВЛЕНИЕ 2:

Это журнал для основного сервера (m3.companyName.com - IP 1.1.1.1) с момента его перезагрузки, пока я не зашел на другой сервер (m2.companyName.com - IP 2.2.2.2) и выполнил руководство rs..reconfig().

2016-09-06T07:42:05.953Z I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-09-06T07:42:05.953Z I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory 'c:/mongossl/data3/diagnostic.data'
2016-09-06T07:42:05.954Z I NETWORK  [initandlisten] waiting for connections on port 40000 ssl
2016-09-06T07:42:05.955Z W NETWORK  [ReplicationExecutor] getaddrinfo("arb.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.955Z I NETWORK  [ReplicationExecutor] getaddrinfo("arb.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.957Z W NETWORK  [ReplicationExecutor] getaddrinfo("m3.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.957Z I NETWORK  [ReplicationExecutor] getaddrinfo("m3.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.958Z W NETWORK  [ReplicationExecutor] getaddrinfo("m2.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.959Z I NETWORK  [ReplicationExecutor] getaddrinfo("m2.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.959Z W REPL     [ReplicationExecutor] Locally stored replica set configuration does not have a valid entry for the current node; waiting for reconfig or remote heartbeat; Got "NodeNotFound: No host described in new configuration 32592 for replica set companyName2 maps to this node" while validating { _id: "companyName2", version: 32592, protocolVersion: 1, members: [ { _id: 1, host: "arb.companyName.com:40000", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "m3.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 11.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "m2.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dfcd0e8ae6154ff80c50d') } }
2016-09-06T07:42:05.959Z I REPL     [ReplicationExecutor] New replica set config in use: { _id: "companyName2", version: 32592, protocolVersion: 1, members: [ { _id: 1, host: "arb.companyName.com:40000", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "m3.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 11.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "m2.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dfcd0e8ae6154ff80c50d') } }
2016-09-06T07:42:05.959Z I REPL     [ReplicationExecutor] This node is not a member of the config
2016-09-06T07:42:05.959Z I REPL     [ReplicationExecutor] transition to REMOVED
2016-09-06T07:42:05.959Z I REPL     [ReplicationExecutor] Starting replication applier threads
2016-09-06T07:42:06.651Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53746 #1 (1 connection now open)
2016-09-06T07:42:06.760Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53747 #2 (2 connections now open)
2016-09-06T07:42:06.864Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53748 #3 (3 connections now open)
2016-09-06T07:42:06.993Z I ACCESS   [conn1]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:07.067Z I ACCESS   [conn2]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:07.159Z I ACCESS   [conn3]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:07.552Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:07.627Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:42:08.975Z I NETWORK  [conn1] end connection 2.2.2.2:53746 (2 connections now open)
2016-09-06T07:42:08.975Z I NETWORK  [conn2] end connection 2.2.2.2:53747 (2 connections now open)
2016-09-06T07:42:08.975Z I NETWORK  [conn3] end connection 2.2.2.2:53748 (2 connections now open)
2016-09-06T07:42:09.371Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53763 #4 (1 connection now open)
2016-09-06T07:42:09.639Z I ACCESS   [conn4]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:13.059Z I NETWORK  [initandlisten] connection accepted from 3.3.3.3:58220 #5 (2 connections now open)
2016-09-06T07:42:13.127Z I ACCESS   [conn5]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=arb.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:13.292Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to arb.companyName.com:40000
2016-09-06T07:42:13.301Z I REPL     [ReplicationExecutor] Member arb.companyName.com:40000 is now in state ARBITER
2016-09-06T07:42:13.974Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53765 #6 (3 connections now open)
2016-09-06T07:42:14.433Z I ACCESS   [conn6] Successfully authenticated as principal appUser on companyName
2016-09-06T07:42:16.629Z I NETWORK  [initandlisten] connection accepted from 1.1.1.13:49162 #7 (4 connections now open)
2016-09-06T07:42:16.853Z I ACCESS   [conn7] Successfully authenticated as principal appUser on companyName
2016-09-06T07:42:17.703Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:42:17.703Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:42:18.131Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:18.206Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:42:23.369Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53767 #8 (5 connections now open)
2016-09-06T07:42:23.832Z I ACCESS   [conn8] Successfully authenticated as principal sa on admin
2016-09-06T07:42:28.356Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:42:38.431Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:42:38.431Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:42:38.861Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:38.936Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:42:49.086Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:42:59.161Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:42:59.161Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:42:59.590Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:59.665Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:43:09.814Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:43:19.889Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:43:19.889Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:43:20.317Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:43:20.392Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:43:30.542Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:43:34.054Z I NETWORK  [initandlisten] connection accepted from 1.1.1.13:49188 #9 (6 connections now open)
2016-09-06T07:43:34.106Z I ACCESS   [conn9] Successfully authenticated as principal sa on admin
2016-09-06T07:43:40.617Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:43:40.617Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:43:41.045Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:43:41.120Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:43:51.270Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:43:51.277Z I NETWORK  [initandlisten] connection accepted from 1.1.1.13:49193 #10 (7 connections now open)
2016-09-06T07:43:51.339Z I ACCESS   [conn10] Successfully authenticated as principal sa on admin
2016-09-06T07:44:01.346Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:44:01.346Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:44:01.775Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:44:01.850Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:44:12.001Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:44:22.077Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:44:22.077Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:44:22.506Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:44:22.582Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:44:32.732Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:44:42.807Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:44:42.807Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:44:43.237Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:44:43.312Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:44:53.462Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:45:03.537Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:45:03.537Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:45:03.966Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:45:04.041Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:45:14.191Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:45:24.266Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:45:24.266Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:45:24.700Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:45:24.775Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:45:34.925Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:45:45.000Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:45:45.000Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:45:45.428Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:45:45.504Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:45:55.654Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:46:05.729Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:46:05.729Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:46:06.157Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:06.232Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:46:16.382Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:46:26.458Z I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:46:26.458Z I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:46:26.889Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:26.964Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:46:37.115Z I REPL     [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:46:43.185Z I NETWORK  [initandlisten] connection accepted from 2.2.2.2:53847 #11 (8 connections now open)
2016-09-06T07:46:43.392Z I ACCESS   [conn11]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:46:43.541Z I NETWORK  [conn11] end connection 2.2.2.2:53847 (7 connections now open)
2016-09-06T07:46:44.370Z I NETWORK  [initandlisten] connection accepted from 3.3.3.3:58224 #12 (8 connections now open)
2016-09-06T07:46:44.434Z I ACCESS   [conn12]  authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=arb.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:46:44.451Z I NETWORK  [conn12] end connection 3.3.3.3:58224 (7 connections now open)
2016-09-06T07:46:47.832Z I REPL     [ReplicationExecutor] New replica set config in use: { _id: "companyName2", version: 32593, protocolVersion: 1, members: [ { _id: 1, host: "arb.companyName.com:40000", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "m3.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 11.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "m2.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dfcd0e8ae6154ff80c50d') } }
2016-09-06T07:46:47.832Z I REPL     [ReplicationExecutor] This node is m3.companyName.com:40000 in the config
2016-09-06T07:46:47.832Z I REPL     [ReplicationExecutor] transition to STARTUP2
2016-09-06T07:46:47.907Z I REPL     [ReplicationExecutor] Scheduling priority takeover at 2016-09-06T03:46:57.907-0400
2016-09-06T07:46:48.040Z I REPL     [ReplicationExecutor] syncing from: m2.companyName.com:40000
2016-09-06T07:46:48.545Z I REPL     [SyncSourceFeedback] setting syncSourceFeedback to m2.companyName.com:40000
2016-09-06T07:46:48.977Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:50.983Z I REPL     [ReplicationExecutor] transition to RECOVERING
2016-09-06T07:46:50.985Z I REPL     [ReplicationExecutor] transition to SECONDARY
2016-09-06T07:46:51.438Z I REPL     [ReplicationExecutor] could not find member to sync from
2016-09-06T07:46:57.907Z I REPL     [ReplicationExecutor] Canceling priority takeover callback
2016-09-06T07:46:57.907Z I REPL     [ReplicationExecutor] Starting an election for a priority takeover
2016-09-06T07:46:57.907Z I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2016-09-06T07:46:57.916Z I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2016-09-06T07:46:57.925Z I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 244
2016-09-06T07:46:57.925Z I REPL     [ReplicationExecutor] transition to PRIMARY
2016-09-06T07:46:58.345Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:58.362Z I ASIO     [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:58.440Z I REPL     [rsSync] transition to primary complete; database writes are now permitted

Самая очевидная вещь, которую я заметил, это ошибка "Нет такого хоста известен". Может быть, Mongo пытается запустить, прежде чем Windows сможет разрешить имена?

2 ответа

Решение

Пожалуйста, задержите запуск Монго. Это решит эту проблему.

У меня возникла та же проблема, когда я пытался заменить вторичный из резервной копии. Проблема заключалась в том, что я запустил процесс mongod на сервере резервного копирования, прежде чем он стал доступен для набора реплик (перед переключением со старого сервера на новый сервер [с резервной копии]). После перезапуска процесса Mongod проблема была решена.

Я предлагаю запустить процесс mongod только после того, как он будет доступен для набора реплик, к которому он должен принадлежать.

Другие вопросы по тегам