Включение XLA JIT с мульти-графическим процессором от tf.slim

Я включил xla на tf.slim с multi-gpu(2 titanXp), как показано ниже. (редактировать train_image_clasifier.py)

   jit_config = tf.ConfigProto()
   jit_level = tf.OptimizerOptions.ON_1
   jit_config.graph_options.optimizer_options.global_jit_level = jit_level

   ###########################
   # Kicks off the training. #
   ###########################

   slim.learning.train(
       .... (same with original)  
       sync_optimizer=optimizer if FLAGS.sync_replicas else None,
       session_config=jit_config)

И я бежал, как следующая команда.

>> python train_image_classifier.py 
--train_dir=/tmp/imagenet_train --dataset_name=imagenet 
--dataset_split_name=train --dataset_dir=$DATA_DIR
--model_name=inception_v3 --max_number_of_steps=5000 
--batch_size=64 --num_clones=2

Тем не менее, я получил эти сообщения об ошибках.

2017-07-31 18:34:05.231408: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_1_bfc) ran out of memory trying to allocate 146.34MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-07-31 18:34:05.246182: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_1_bfc) ran out of memory trying to allocate 1.48GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
......
2017-07-31 18:34:06.311713: E tensorflow/stream_executor/cuda/cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-07-31 18:34:06.311761: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0xd25d400: CUDA_ERROR_ILLEGAL_ADDRESS
2017-07-31 18:34:06.311790: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0xd25d400: CUDA_ERROR_ILLEGAL_ADDRESS
2017-07-31 18:34:06.311855: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311869: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311942: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311959: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311973: E tensorflow/compiler/xla/service/gpu/convolution_thunk.cc:325] No convolution algorithm works with profiling. Fall back to the default algorithm.
2017-07-31 18:34:06.311983: E tensorflow/compiler/xla/service/gpu/convolution_thunk.cc:334] No convolution algorithm without scratch works with profiling. Fall back to the default algorithm.
2017-07-31 18:34:06.312037: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2877] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
Aborted (core dumped)

Ранее xla jit хорошо работал с этими флагами (-batch_size=32 --num_clones=1). Я думаю, что есть ошибка на xla о выделении буфера. Может кто-нибудь помочь мне, что я сделал что-то не так?

0 ответов

Другие вопросы по тегам