Запущенный тензор потока застрял после "Создание устройств Тензор"
Я запустил некоторый код тензорного потока на сервере, как это:
- 8 Nvidia GTX1080
- около 40G графики памяти
- 200 ГБ памяти
но прогресс всегда будет останавливаться на "Создание устройства TensorFlow" и больше не отображать информацию, а терминал был мертв. Другой проект TF работает хорошо, но это всегда будет неудачей.
2017-11-20 23:32:51.701175: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701252: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701280: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701293: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701320: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:52.552691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 1.32GiB
2017-11-20 23:32:53.059142: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x842f610 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:53.060813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:05:00.0
Total memory: 7.92GiB
Free memory: 7.70GiB
2017-11-20 23:32:53.582481: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8433370 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:53.584843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 2 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:08:00.0
Total memory: 7.92GiB
Free memory: 1.16GiB
2017-11-20 23:32:54.094126: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x84370d0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:54.095696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 3 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:09:00.0
Total memory: 7.92GiB
Free memory: 1.82GiB
2017-11-20 23:32:54.633158: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x843ae30 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:54.634412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 4 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:84:00.0
Total memory: 7.92GiB
Free memory: 1.80GiB
2017-11-20 23:32:55.226210: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x843eb90 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:55.227841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 5 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:85:00.0
Total memory: 7.92GiB
Free memory: 4.79GiB
2017-11-20 23:32:55.789872: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x84428f0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:55.790904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 6 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:88:00.0
Total memory: 7.92GiB
Free memory: 773.00MiB
2017-11-20 23:32:56.371886: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8446650 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:56.373006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 7 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:89:00.0
Total memory: 7.92GiB
Free memory: 1001.00MiB
2017-11-20 23:32:56.374795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 4
2017-11-20 23:32:56.374826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 5
2017-11-20 23:32:56.374838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 6
2017-11-20 23:32:56.374850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 7
2017-11-20 23:32:56.375122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 4
2017-11-20 23:32:56.375151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 5
2017-11-20 23:32:56.375178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 6
2017-11-20 23:32:56.375189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 7
2017-11-20 23:32:56.375359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 4
2017-11-20 23:32:56.375374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 5
2017-11-20 23:32:56.375385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 6
2017-11-20 23:32:56.375397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 7
2017-11-20 23:32:56.375465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 4
2017-11-20 23:32:56.375477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 5
2017-11-20 23:32:56.375490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 6
2017-11-20 23:32:56.375501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 7
2017-11-20 23:32:56.375513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 0
2017-11-20 23:32:56.375524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 1
2017-11-20 23:32:56.375536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 2
2017-11-20 23:32:56.375548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 3
2017-11-20 23:32:56.377945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 0
2017-11-20 23:32:56.378028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 1
2017-11-20 23:32:56.378052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 2
2017-11-20 23:32:56.378074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 3
2017-11-20 23:32:56.378504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 0
2017-11-20 23:32:56.378528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 1
2017-11-20 23:32:56.378556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 2
2017-11-20 23:32:56.378591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 3
2017-11-20 23:32:56.378883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 0
2017-11-20 23:32:56.378912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 1
2017-11-20 23:32:56.378936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 2
2017-11-20 23:32:56.378959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 3
2017-11-20 23:32:56.379568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 4 5 6 7
2017-11-20 23:32:56.379591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y Y Y Y N N N N
2017-11-20 23:32:56.379607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: Y Y Y Y N N N N
2017-11-20 23:32:56.379621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2: Y Y Y Y N N N N
2017-11-20 23:32:56.379635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3: Y Y Y Y N N N N
2017-11-20 23:32:56.379651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 4: N N N N Y Y Y Y
2017-11-20 23:32:56.379664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 5: N N N N Y Y Y Y
2017-11-20 23:32:56.379680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 6: N N N N Y Y Y Y
2017-11-20 23:32:56.379694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 7: N N N N Y Y Y Y
2017-11-20 23:32:56.379724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-11-20 23:32:56.379742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:05:00.0)
2017-11-20 23:32:56.379757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080, pci bus id: 0000:08:00.0)
2017-11-20 23:32:56.379772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080, pci bus id: 0000:09:00.0)
2017-11-20 23:32:56.379786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: GeForce GTX 1080, pci bus id: 0000:84:00.0)
2017-11-20 23:32:56.379800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: GeForce GTX 1080, pci bus id: 0000:85:00.0)
2017-11-20 23:32:56.379816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: GeForce GTX 1080, pci bus id: 0000:88:00.0)
2017-11-20 23:32:56.379829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: GeForce GTX 1080, pci bus id: 0000:89:00.0)
в то время как nvidia-smi
показывает:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 On | 0000:04:00.0 Off | N/A |
| 56% 76C P2 161W / 180W | 6652MiB / 8114MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 On | 0000:05:00.0 Off | N/A |
| 24% 38C P8 11W / 180W | 115MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 On | 0000:08:00.0 Off | N/A |
| 24% 42C P8 12W / 180W | 6820MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 1080 On | 0000:09:00.0 Off | N/A |
| 24% 44C P8 12W / 180W | 6142MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 1080 On | 0000:84:00.0 Off | N/A |
| 67% 82C P2 179W / 180W | 6160MiB / 8114MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 1080 On | 0000:85:00.0 Off | N/A |
| 48% 71C P2 81W / 180W | 3094MiB / 8114MiB | 76% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 1080 On | 0000:88:00.0 Off | N/A |
| 25% 58C P2 51W / 180W | 7230MiB / 8114MiB | 66% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 1080 On | 0000:89:00.0 Off | N/A |
| 26% 59C P2 52W / 180W | 7002MiB / 8114MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17811 C python 327MiB |
| 0 21873 C python 254MiB |
| 0 25407 C ../../caffe/build/tools/caffe 6065MiB |
| 1 17811 C python 113MiB |
| 2 17811 C python 113MiB |
| 2 22605 C python 6705MiB |
| 3 17811 C python 113MiB |
| 3 22605 C python 6027MiB |
| 4 8984 C ./build/tools/caffe 6045MiB |
| 4 17811 C python 113MiB |
| 5 17811 C python 113MiB |
| 5 21873 C python 2977MiB |
| 6 13442 C python 7115MiB |
| 6 17811 C python 113MiB |
| 7 13442 C python 6887MiB |
| 7 17811 C python 113MiB |
+-----------------------------------------------------------------------------+
Так как это не вызвало ошибки памяти CUDA, а памяти достаточно и она использовалась (я видел, что ЦП использовал немного), я не думаю, что это вопрос ресурсов.... Но я застрял более 20 часов... Может ли кто-нибудь помочь мне с этим? Большое спасибо.
1 ответ
У меня также была аналогичная проблема, когда я использовал API набора данных Tensorflow с tfrecords 2.x, у меня было около 24 тыс. tfrecords, мой предыдущий конвейер данных был
def load_training_tfrecords(record_mask_file,SHUFFLE):
dataset=tf.data.Dataset.list_files(record_mask_file).interleave(lambda x: tf.data.TFRecordDataset(x),cycle_length=NUMBER_OF_PARALLEL_CALL,num_parallel_calls=NUMBER_OF_PARALLEL_CALL)
dataset=dataset.map(decode_memo_train).map(augmentation_Aggresive).repeat().batch(BATCH_SIZE)
batched_dataset=dataset.prefetch(PARSHING)
return batched_dataset
Я удалил Interleave, который решил мою проблему. вы также можете избежать предварительной выборки.