R пакет или методика для сохранения прогресса в работе параллельных вычислений
Я пытаюсь запланировать следующий сценарий R (exposure_train.R
):
.libPaths("/gscratch/csde/mienkoja/rpackages")
library(Rmpi)
library(missForest)
library(snow)
library(doSNOW)
# load the data prepared above
load("/home/mienkoja/pse_rodis/dat_clean.rds")
# get rid of garbage from memory
gc()
# read node information from the system environment
nodefile <- Sys.getenv("PBS_NODEFILE")
# assign node information to a nodes object
nodes <- readLines(nodefile)
# create a cluster
cl <- makeMPIcluster(length(nodes), includemaster=TRUE)
# register the cluster
registerDoSNOW(cl)
rf_random <- missForest(dat_clean, maxiter = 10, ntree = 100, parallelize = "forests")
stopCluster(cl)
save(rf_random
,file = "/home/mienkoja/pse_rodis/rf_random_out.rds")
Я работаю над кластером с несколькими сотнями узлов - каждый из которых обычно имеет 16 процессорных ядер и 128 ГБ памяти. Все узлы работают под управлением CentOS 6 Linux, и они связаны между собой кластерным программным обеспечением Moab. Я планирую работу с помощью планировщика TORQUE, используя следующий файл Bash
#!/bin/bash
### User specs
## Name the job "hyak_train"
#PBS -N hyak_train
## Request 16 CPUs (cores) on 2 nodes, 32 total cores
## If the job doesn't finish in 1 day, cancel it
#PBS -l nodes=2:ppn=16,pmem=2gb,feature=16core,walltime=24:00:00
## Put the output from jobs into the below directory
#PBS -o /gscratch/csde/mienkoja/exposure_train_outn_2p16
## Put both the stderr and stdout into a single file
#PBS -k oe
#PBS -j oe
#PBS -d /home/mienkoja/pse_rodis
## Send an email when the job is aborted, begins, or terminates
#PBS -m abe
### Standard specs
HYAK_NPE=$(wc -l < $PBS_NODEFILE)
HYAK_NNODES=$(uniq $PBS_NODEFILE | wc -l )
HYAK_TPN=$((HYAK_NPE/HYAK_NNODES))
NODEMEM=`grep MemTotal /proc/meminfo | awk '{print $2}'`
NODEFREE=$((NODEMEM-2097152))
MEMPERTASK=$((NODEFREE/HYAK_TPN))
ulimit -v $MEMPERTASK
export MX_RCACHE=0
### Modules
module load r_3.2.4
module load icc_14.0.3-ompi_1.8.3
### App
Rscript exposure_train.R
Я получаю следующий вывод, используя скрипт выше:
Warning message:
package ‘Rmpi’ was built under R version 3.3.2
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2
2: package ‘foreach’ was built under R version 3.3.2
3: package ‘iterators’ was built under R version 3.3.2
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2
Attaching package: ‘snow’
The following object is masked from ‘package:doMPI’:
sinkWorkerOutput
Warning message:
package ‘snow’ was built under R version 3.3.2
Warning message:
package ‘doSNOW’ was built under R version 3.3.2
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 114354 6.2 350000 18.7 350000 18.7
Vcells 13378510 102.1 17463238 133.3 13379451 102.1
[n0729:14816] [[43070,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 17 --singleton-died-pipe 18 -mca state_novm_select 1 -mca ess_base_jobid 2822635520
32 slaves are spawned successfully. 0 failed.
missForest iteration 1 in progress...Warning message:
package ‘Rmpi’ was built under R version 3.3.2
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2
2: package ‘foreach’ was built under R version 3.3.2
3: package ‘iterators’ was built under R version 3.3.2
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2
Attaching package: ‘snow’
The following object is masked from ‘package:doMPI’:
sinkWorkerOutput
Warning message:
package ‘snow’ was built under R version 3.3.2
Warning message:
package ‘doSNOW’ was built under R version 3.3.2
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 114354 6.2 350000 18.7 350000 18.7
Vcells 13378510 102.1 17463238 133.3 13379451 102.1
[n0829:32101] [[43995,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 17 --singleton-died-pipe 18 -mca state_novm_select 1 -mca ess_base_jobid 2883256320
32 slaves are spawned successfully. 0 failed.
missForest iteration 1 in progress...Warning message:
package ‘Rmpi’ was built under R version 3.3.2
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2
2: package ‘foreach’ was built under R version 3.3.2
3: package ‘iterators’ was built under R version 3.3.2
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2
Attaching package: ‘snow’
The following object is masked from ‘package:doMPI’:
sinkWorkerOutput
Warning message:
package ‘snow’ was built under R version 3.3.2
Warning message:
package ‘doSNOW’ was built under R version 3.3.2
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 114354 6.2 350000 18.7 350000 18.7
Vcells 13378510 102.1 17463238 133.3 13379451 102.1
[n0829:32615] [[43481,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 17 --singleton-died-pipe 18 -mca state_novm_select 1 -mca ess_base_jobid 2849570816
32 slaves are spawned successfully. 0 failed.
missForest iteration 1 in progress...Warning message:
package ‘Rmpi’ was built under R version 3.3.2
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2
2: package ‘foreach’ was built under R version 3.3.2
3: package ‘iterators’ was built under R version 3.3.2
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2
Attaching package: ‘snow’
The following object is masked from ‘package:doMPI’:
sinkWorkerOutput
Warning message:
package ‘snow’ was built under R version 3.3.2
Warning message:
package ‘doSNOW’ was built under R version 3.3.2
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 114354 6.2 350000 18.7 350000 18.7
Vcells 13378510 102.1 17463238 133.3 13379451 102.1
[n0768:11458] [[54933,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 17 --singleton-died-pipe 18 -mca state_novm_select 1 -mca ess_base_jobid 3600089088
32 slaves are spawned successfully. 0 failed.
missForest iteration 1 in progress...done!
missForest iteration 2 in progress...Warning message:
package ‘Rmpi’ was built under R version 3.3.2
Loading required package: foreach
Loading required package: iterators
Warning messages:
1: package ‘doMPI’ was built under R version 3.3.2
2: package ‘foreach’ was built under R version 3.3.2
3: package ‘iterators’ was built under R version 3.3.2
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Warning message:
package ‘randomForest’ was built under R version 3.3.2
Attaching package: ‘snow’
The following object is masked from ‘package:doMPI’:
sinkWorkerOutput
....
Этот вывод повторяется несколько раз, но дело в том, что я получаю 22 успешных итерации (т.е. missForest iteration 1 in progress...done!
) в течение дня, когда у меня запланирована работа.
Проблема в том, что я планирую задание в очереди обратной засыпки, и задание выгружается (как минимум) один раз в 4 часа. Это означает, что я никогда не выйду за пределы второй итерации (т.е. missForest iteration 2 in progress...
) потому что работа начинается каждые 4 часа.
Хотя все итерации не созданы равными, так как я хочу только maxiter = 10
(см. сценарий R выше) кажется, что одного дня достаточно для запуска задания, при условии, что я могу встроить какую-то схему контрольных точек (сохраняя мой прогресс) в сценарий R.
Источник missForest
Функция доступна в пакете репо. Подходящий foreach
цикл (основанный на моей параметризации parallelize = "forests”
) вставляется ниже.
foreach(xntree=idiv(ntree, chunks=getDoParWorkers()),
.combine='combine', .multicombine=TRUE,
.packages='randomForest') %dopar% {
randomForest(
x = obsX,
y = obsY,
ntree = xntree,
mtry = mtry,
replace = replace,
classwt = if (!is.null(classwt)) classwt[[varInd]] else
rep(1, nlevels(obsY)),
cutoff = if (!is.null(cutoff)) cutoff[[varInd]] else
rep(1/nlevels(obsY), nlevels(obsY)),
strata = if (!is.null(strata)) strata[[varInd]] else obsY,
sampsize = if (!is.null(sampsize)) sampsize[[varInd]] else
if (replace) nrow(obsX) else ceiling(0.632*nrow(obsX)),
nodesize = if (!is.null(nodesize)) nodesize[2] else 5,
maxnodes = if (!is.null(maxnodes)) maxnodes else NULL)
}
Похоже, что решение просто переписать missForest
функции и записать соответствующую информацию из каждой итерации в соответствующим образом проиндексированный файл, чтобы я мог выбрать, где я остановился на следующем вытеснении.
ВОПРОС: Существуют ли какие-либо пакеты / методики R, упрощающие реализацию схемы контрольных точек, подобной этой?