Распределенная ошибка тензорного потока CPU+GPU с примером
Я установил NVIDIA toolkit 10.0, tensorflow 1.15 и tensorflow-gpu 1.15, cuDNN 10.0.
Я последовал примеру здесь
# On ps0.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=worker --task_index=1
С "На ps0.example.com:" это сработало. Однако он не работал с другими командами, такими как "# На ps1.example.com:","worker 0" и "worker1".
Ошибки:
2020-07-01 22:59:22.564024: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
WARNING:tensorflow:From trainer.py:85: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.
WARNING:tensorflow:From trainer.py:17: The name tf.train.Server is deprecated. Please use tf.distribute.Server instead.
W0701 22:59:25.564601 14352 module_wrapper.py:139] From trainer.py:17: The name tf.train.Server is deprecated. Please use tf.distribute.Server instead.
2020-07-01 22:59:25.566460: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-01 22:59:25.576286: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-01 22:59:25.612394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Quadro RTX 4000 major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:21:00.0
2020-07-01 22:59:25.620730: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-07-01 22:59:25.631026: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-07-01 22:59:25.639325: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-07-01 22:59:25.645747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-07-01 22:59:25.656569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-07-01 22:59:25.661838: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-07-01 22:59:25.672331: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-01 22:59:25.676277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 22:59:26.272129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-01 22:59:26.277910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-07-01 22:59:26.283441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-07-01 22:59:26.287436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 6295 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:21:00.0, compute capability: 7.5)
E0701 22:59:26.302000000 14352 external/grpc/src/core/ext/transport/chttp2/server/insecure/server_chttp2.cc:40] {"created":"@1593637166.302000000","description":"No address added out of total 1 resolved","file":"external/grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":348,"referenced_errors":[{"created":"@1593637166.302000000","description":"Failed to add port to server","file":"external/grpc/src/core/lib/iomgr/tcp_server_windows.cc","file_line":509,"referenced_errors":[{"created":"@1593637166.302000000","description":"OS Error","file":"external/grpc/src/core/lib/iomgr/tcp_server_windows.cc","file_line":201,"os_error":"Only one usage of each socket address (protocol/network address/port) is normally permitted.\r\n","syscall":"bind","wsa_error":10048}]}]}
2020-07-01 22:59:26.336729: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:449] Unknown: Could not start gRPC server
Traceback (most recent call last):
File "trainer.py", line 85, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\ProgramData\Anaconda3\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\ProgramData\Anaconda3\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "trainer.py", line 19, in main
task_index=FLAGS.task_index)
File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\training\server_lib.py", line 147, in __init__
self._server = c_api.TF_NewServer(self._server_def.SerializeToString())
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server