R пакет или методика для сохранения прогресса в работе параллельных вычислений

Я пытаюсь запланировать следующий сценарий R (exposure_train.R):

.libPaths("/gscratch/csde/mienkoja/rpackages")
library(Rmpi)
library(missForest)
library(snow)
library(doSNOW)

# load the data prepared above
load("/home/mienkoja/pse_rodis/dat_clean.rds")

# get rid of garbage from memory
gc()

# read node information from the system environment
nodefile <- Sys.getenv("PBS_NODEFILE")

# assign node information to a nodes object
nodes <- readLines(nodefile)

# create a cluster
cl <- makeMPIcluster(length(nodes), includemaster=TRUE)

# register the cluster
registerDoSNOW(cl)

rf_random <- missForest(dat_clean, maxiter = 10, ntree = 100, parallelize = "forests")

stopCluster(cl)

save(rf_random
     ,file = "/home/mienkoja/pse_rodis/rf_random_out.rds")

Я работаю над кластером с несколькими сотнями узлов - каждый из которых обычно имеет 16 процессорных ядер и 128 ГБ памяти. Все узлы работают под управлением CentOS 6 Linux, и они связаны между собой кластерным программным обеспечением Moab. Я планирую работу с помощью планировщика TORQUE, используя следующий файл Bash

#!/bin/bash

### User specs
## Name the job "hyak_train"
#PBS -N hyak_train
## Request 16 CPUs (cores) on 2 nodes, 32 total cores
## If the job doesn't finish in 1 day, cancel it
#PBS -l nodes=2:ppn=16,pmem=2gb,feature=16core,walltime=24:00:00
## Put the output from jobs into the below directory
#PBS -o /gscratch/csde/mienkoja/exposure_train_outn_2p16
## Put both the stderr and stdout into a single file
#PBS -k oe
#PBS -j oe
#PBS -d /home/mienkoja/pse_rodis
## Send an email when the job is aborted, begins, or terminates
#PBS -m abe

### Standard specs
HYAK_NPE=$(wc -l < $PBS_NODEFILE)
HYAK_NNODES=$(uniq $PBS_NODEFILE | wc -l )
HYAK_TPN=$((HYAK_NPE/HYAK_NNODES))
NODEMEM=`grep MemTotal /proc/meminfo | awk '{print $2}'`
NODEFREE=$((NODEMEM-2097152))
MEMPERTASK=$((NODEFREE/HYAK_TPN))
ulimit -v $MEMPERTASK
export MX_RCACHE=0

### Modules
module load r_3.2.4
module load icc_14.0.3-ompi_1.8.3

### App
Rscript exposure_train.R

Я получаю следующий вывод, используя скрипт выше:

Warning message:
package ‘Rmpi’ was built under R version 3.3.2 
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2 
2: package ‘foreach’ was built under R version 3.3.2 
3: package ‘iterators’ was built under R version 3.3.2 
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2 

Attaching package: ‘snow’

The following object is masked from ‘package:doMPI’:

    sinkWorkerOutput

Warning message:
package ‘snow’ was built under R version 3.3.2 
Warning message:
package ‘doSNOW’ was built under R version 3.3.2 
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   114354   6.2     350000  18.7   350000  18.7
Vcells 13378510 102.1   17463238 133.3 13379451 102.1
[n0729:14816] [[43070,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 17 --singleton-died-pipe 18 -mca state_novm_select 1 -mca ess_base_jobid 2822635520
    32 slaves are spawned successfully. 0 failed.
  missForest iteration 1 in progress...Warning message:
package ‘Rmpi’ was built under R version 3.3.2 
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2 
2: package ‘foreach’ was built under R version 3.3.2 
3: package ‘iterators’ was built under R version 3.3.2 
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2 

Attaching package: ‘snow’

The following object is masked from ‘package:doMPI’:

    sinkWorkerOutput

Warning message:
package ‘snow’ was built under R version 3.3.2 
Warning message:
package ‘doSNOW’ was built under R version 3.3.2 
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   114354   6.2     350000  18.7   350000  18.7
Vcells 13378510 102.1   17463238 133.3 13379451 102.1
[n0829:32101] [[43995,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 17 --singleton-died-pipe 18 -mca state_novm_select 1 -mca ess_base_jobid 2883256320
    32 slaves are spawned successfully. 0 failed.
  missForest iteration 1 in progress...Warning message:
package ‘Rmpi’ was built under R version 3.3.2 
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2 
2: package ‘foreach’ was built under R version 3.3.2 
3: package ‘iterators’ was built under R version 3.3.2 
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2 

Attaching package: ‘snow’

The following object is masked from ‘package:doMPI’:

    sinkWorkerOutput

Warning message:
package ‘snow’ was built under R version 3.3.2 
Warning message:
package ‘doSNOW’ was built under R version 3.3.2 
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   114354   6.2     350000  18.7   350000  18.7
Vcells 13378510 102.1   17463238 133.3 13379451 102.1
[n0829:32615] [[43481,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 17 --singleton-died-pipe 18 -mca state_novm_select 1 -mca ess_base_jobid 2849570816
    32 slaves are spawned successfully. 0 failed.
  missForest iteration 1 in progress...Warning message:
package ‘Rmpi’ was built under R version 3.3.2 
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2 
2: package ‘foreach’ was built under R version 3.3.2 
3: package ‘iterators’ was built under R version 3.3.2 
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2 

Attaching package: ‘snow’

The following object is masked from ‘package:doMPI’:

    sinkWorkerOutput

Warning message:
package ‘snow’ was built under R version 3.3.2 
Warning message:
package ‘doSNOW’ was built under R version 3.3.2 
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   114354   6.2     350000  18.7   350000  18.7
Vcells 13378510 102.1   17463238 133.3 13379451 102.1
[n0768:11458] [[54933,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 17 --singleton-died-pipe 18 -mca state_novm_select 1 -mca ess_base_jobid 3600089088
    32 slaves are spawned successfully. 0 failed.
  missForest iteration 1 in progress...done!
  missForest iteration 2 in progress...Warning message:
package ‘Rmpi’ was built under R version 3.3.2 
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2 
2: package ‘foreach’ was built under R version 3.3.2 
3: package ‘iterators’ was built under R version 3.3.2 
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2 

Attaching package: ‘snow’

The following object is masked from ‘package:doMPI’:

    sinkWorkerOutput

....

Этот вывод повторяется несколько раз, но дело в том, что я получаю 22 успешных итерации (т.е. missForest iteration 1 in progress...done!) в течение дня, когда у меня запланирована работа.

Проблема в том, что я планирую задание в очереди обратной засыпки, и задание выгружается (как минимум) один раз в 4 часа. Это означает, что я никогда не выйду за пределы второй итерации (т.е. missForest iteration 2 in progress...) потому что работа начинается каждые 4 часа.

Хотя все итерации не созданы равными, так как я хочу только maxiter = 10 (см. сценарий R выше) кажется, что одного дня достаточно для запуска задания, при условии, что я могу встроить какую-то схему контрольных точек (сохраняя мой прогресс) в сценарий R.

Источник missForest Функция доступна в пакете репо. Подходящий foreach цикл (основанный на моей параметризации parallelize = "forests”) вставляется ниже.

foreach(xntree=idiv(ntree, chunks=getDoParWorkers()),
                              .combine='combine', .multicombine=TRUE,
                              .packages='randomForest') %dopar% {
                                randomForest(
                                  x = obsX,
                                  y = obsY,
                                  ntree = xntree,
                                  mtry = mtry,
                                  replace = replace,
                                  classwt = if (!is.null(classwt)) classwt[[varInd]] else
                                    rep(1, nlevels(obsY)),
                                  cutoff = if (!is.null(cutoff)) cutoff[[varInd]] else
                                    rep(1/nlevels(obsY), nlevels(obsY)),
                                  strata = if (!is.null(strata)) strata[[varInd]] else obsY,
                                  sampsize = if (!is.null(sampsize)) sampsize[[varInd]] else
                                    if (replace) nrow(obsX) else ceiling(0.632*nrow(obsX)),
                                  nodesize = if (!is.null(nodesize)) nodesize[2] else 5,
                                  maxnodes = if (!is.null(maxnodes)) maxnodes else NULL)
                              }

Похоже, что решение просто переписать missForest функции и записать соответствующую информацию из каждой итерации в соответствующим образом проиндексированный файл, чтобы я мог выбрать, где я остановился на следующем вытеснении.

ВОПРОС: Существуют ли какие-либо пакеты / методики R, упрощающие реализацию схемы контрольных точек, подобной этой?

0 ответов

Другие вопросы по тегам