Запущенный тензор потока застрял после "Создание устройств Тензор"

Я запустил некоторый код тензорного потока на сервере, как это:

  • 8 Nvidia GTX1080
  • около 40G графики памяти
  • 200 ГБ памяти

но прогресс всегда будет останавливаться на "Создание устройства TensorFlow" и больше не отображать информацию, а терминал был мертв. Другой проект TF работает хорошо, но это всегда будет неудачей.

2017-11-20 23:32:51.701175: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701252: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701280: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701293: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:51.701320: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 23:32:52.552691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:04:00.0
Total memory: 7.92GiB
Free memory: 1.32GiB
2017-11-20 23:32:53.059142: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x842f610 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:53.060813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:05:00.0
Total memory: 7.92GiB
Free memory: 7.70GiB
2017-11-20 23:32:53.582481: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8433370 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:53.584843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 2 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:08:00.0
Total memory: 7.92GiB
Free memory: 1.16GiB
2017-11-20 23:32:54.094126: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x84370d0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:54.095696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 3 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:09:00.0
Total memory: 7.92GiB
Free memory: 1.82GiB
2017-11-20 23:32:54.633158: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x843ae30 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:54.634412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 4 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:84:00.0
Total memory: 7.92GiB
Free memory: 1.80GiB
2017-11-20 23:32:55.226210: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x843eb90 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:55.227841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 5 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:85:00.0
Total memory: 7.92GiB
Free memory: 4.79GiB
2017-11-20 23:32:55.789872: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x84428f0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:55.790904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 6 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:88:00.0
Total memory: 7.92GiB
Free memory: 773.00MiB
2017-11-20 23:32:56.371886: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8446650 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 23:32:56.373006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 7 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:89:00.0
Total memory: 7.92GiB
Free memory: 1001.00MiB
2017-11-20 23:32:56.374795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 4
2017-11-20 23:32:56.374826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 5
2017-11-20 23:32:56.374838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 6
2017-11-20 23:32:56.374850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 7
2017-11-20 23:32:56.375122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 4
2017-11-20 23:32:56.375151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 5
2017-11-20 23:32:56.375178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 6
2017-11-20 23:32:56.375189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 7
2017-11-20 23:32:56.375359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 4
2017-11-20 23:32:56.375374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 5
2017-11-20 23:32:56.375385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 6
2017-11-20 23:32:56.375397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 2 and 7
2017-11-20 23:32:56.375465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 4
2017-11-20 23:32:56.375477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 5
2017-11-20 23:32:56.375490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 6
2017-11-20 23:32:56.375501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 3 and 7
2017-11-20 23:32:56.375513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 0
2017-11-20 23:32:56.375524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 1
2017-11-20 23:32:56.375536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 2
2017-11-20 23:32:56.375548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 4 and 3
2017-11-20 23:32:56.377945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 0
2017-11-20 23:32:56.378028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 1
2017-11-20 23:32:56.378052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 2
2017-11-20 23:32:56.378074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 5 and 3
2017-11-20 23:32:56.378504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 0
2017-11-20 23:32:56.378528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 1
2017-11-20 23:32:56.378556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 2
2017-11-20 23:32:56.378591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 6 and 3
2017-11-20 23:32:56.378883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 0
2017-11-20 23:32:56.378912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 1
2017-11-20 23:32:56.378936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 2
2017-11-20 23:32:56.378959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 7 and 3
2017-11-20 23:32:56.379568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 4 5 6 7
2017-11-20 23:32:56.379591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y Y Y Y N N N N
2017-11-20 23:32:56.379607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1:   Y Y Y Y N N N N
2017-11-20 23:32:56.379621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2:   Y Y Y Y N N N N
2017-11-20 23:32:56.379635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3:   Y Y Y Y N N N N
2017-11-20 23:32:56.379651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 4:   N N N N Y Y Y Y
2017-11-20 23:32:56.379664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 5:   N N N N Y Y Y Y
2017-11-20 23:32:56.379680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 6:   N N N N Y Y Y Y
2017-11-20 23:32:56.379694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 7:   N N N N Y Y Y Y
2017-11-20 23:32:56.379724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:04:00.0)
2017-11-20 23:32:56.379742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:05:00.0)
2017-11-20 23:32:56.379757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080, pci bus id: 0000:08:00.0)
2017-11-20 23:32:56.379772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080, pci bus id: 0000:09:00.0)
2017-11-20 23:32:56.379786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: GeForce GTX 1080, pci bus id: 0000:84:00.0)
2017-11-20 23:32:56.379800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: GeForce GTX 1080, pci bus id: 0000:85:00.0)
2017-11-20 23:32:56.379816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: GeForce GTX 1080, pci bus id: 0000:88:00.0)
2017-11-20 23:32:56.379829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: GeForce GTX 1080, pci bus id: 0000:89:00.0)

в то время как nvidia-smiпоказывает:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 0000:04:00.0     Off |                  N/A |
| 56%   76C    P2   161W / 180W |   6652MiB /  8114MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    On   | 0000:05:00.0     Off |                  N/A |
| 24%   38C    P8    11W / 180W |    115MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    On   | 0000:08:00.0     Off |                  N/A |
| 24%   42C    P8    12W / 180W |   6820MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    On   | 0000:09:00.0     Off |                  N/A |
| 24%   44C    P8    12W / 180W |   6142MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 1080    On   | 0000:84:00.0     Off |                  N/A |
| 67%   82C    P2   179W / 180W |   6160MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 1080    On   | 0000:85:00.0     Off |                  N/A |
| 48%   71C    P2    81W / 180W |   3094MiB /  8114MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 1080    On   | 0000:88:00.0     Off |                  N/A |
| 25%   58C    P2    51W / 180W |   7230MiB /  8114MiB |     66%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 1080    On   | 0000:89:00.0     Off |                  N/A |
| 26%   59C    P2    52W / 180W |   7002MiB /  8114MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     17811    C   python                                         327MiB |
|    0     21873    C   python                                         254MiB |
|    0     25407    C   ../../caffe/build/tools/caffe                 6065MiB |
|    1     17811    C   python                                         113MiB |
|    2     17811    C   python                                         113MiB |
|    2     22605    C   python                                        6705MiB |
|    3     17811    C   python                                         113MiB |
|    3     22605    C   python                                        6027MiB |
|    4      8984    C   ./build/tools/caffe                           6045MiB |
|    4     17811    C   python                                         113MiB |
|    5     17811    C   python                                         113MiB |
|    5     21873    C   python                                        2977MiB |
|    6     13442    C   python                                        7115MiB |
|    6     17811    C   python                                         113MiB |
|    7     13442    C   python                                        6887MiB |
|    7     17811    C   python                                         113MiB |
+-----------------------------------------------------------------------------+

Так как это не вызвало ошибки памяти CUDA, а памяти достаточно и она использовалась (я видел, что ЦП использовал немного), я не думаю, что это вопрос ресурсов.... Но я застрял более 20 часов... Может ли кто-нибудь помочь мне с этим? Большое спасибо.

1 ответ

У меня также была аналогичная проблема, когда я использовал API набора данных Tensorflow с tfrecords 2.x, у меня было около 24 тыс. tfrecords, мой предыдущий конвейер данных был

      def load_training_tfrecords(record_mask_file,SHUFFLE):
dataset=tf.data.Dataset.list_files(record_mask_file).interleave(lambda x: tf.data.TFRecordDataset(x),cycle_length=NUMBER_OF_PARALLEL_CALL,num_parallel_calls=NUMBER_OF_PARALLEL_CALL)
dataset=dataset.map(decode_memo_train).map(augmentation_Aggresive).repeat().batch(BATCH_SIZE)
batched_dataset=dataset.prefetch(PARSHING)
return batched_dataset

Я удалил Interleave, который решил мою проблему. вы также можете избежать предварительной выборки.

Другие вопросы по тегам