Использует ли sbf() метрический аргумент для оптимизации модели?

Question

Использует ли sbf() метрический аргумент для оптимизации модели?

Переходя ROC как metric значение аргумента для caretSBF функция

Наша цель - использовать итоговую метрику ROC для выбора модели при выполнении выбора по фильтрации sbf() функция для выбора функций.

BreastCancer набор данных был использован в качестве воспроизводимого примера из mlbench пакет для запуска train() а также sbf() с metric = "Accuracy" а также metric = "ROC"

Мы хотим убедиться, sbf() принимает metric аргумент в применении train() а также rfe() функции для оптимизации модели. С этой целью мы планировали использовать train() функция с sbf() функция. caretSBF$fit функция делает вызов train(), а также caretSBF передается sbfControl,

Из вывода, кажется, metric Аргумент используется только для inner resampling и не для sbf часть, то есть для outer resampling выходной metric аргумент не был применен, как используется train() а также rfe(),

Как мы использовали caretSBF который использует train() Похоже, что metric аргумент ограничен в области train() и, следовательно, не передается sbf,

Мы были бы признательны за уточнение sbf() использования metric аргумент для оптимизации модели, то есть для outer resampling?

Вот наша работа на воспроизводимом примере, показывающая train() использования metric использование аргумента Accuracy а также ROC, но для sbf мы не уверены.

I. РАЗДЕЛ ДАННЫХ

  ## Loading required packages   
  library(mlbench)
  library(caret)

  ## Loading `BreastCancer` Dataset from *mlbench* package   
  data("BreastCancer")

  ## Data cleaning for missing values
  # Remove rows/observation with NA Values in any of the columns
  BrC1 <- BreastCancer[complete.cases(BreastCancer),] 

  # Removing Class and Id Column and keeping just Numeric Predictors
  Num_Pred <- BrC1[,2:10]

II. ИНДИВИДУАЛЬНАЯ ФУНКЦИЯ

Определение итоговой функции fiveStats

  fiveStats <- function(...) c(twoClassSummary(...),
                         defaultSummary(...))

III. СЕКЦИЯ ПОЕЗДА

Определение trControl

  trCtrl <- trainControl(method="repeatedcv", number=10,
  repeats=1, classProbs = TRUE, summaryFunction = fiveStats)

TRAIN + METRIC = "Точность"

   set.seed(1)
   TR_acc <- train(Num_Pred,BrC1$Class, method="rf",metric="Accuracy",
   trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))

   TR_acc
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 615, 614, 614, 614, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9936532  0.9729798  0.9833333  0.9765772  0.9490311
   #   3     0.9936544  0.9729293  0.9791667  0.9750853  0.9457534
   #   4     0.9929957  0.9684343  0.9750000  0.9706948  0.9361373
   #   5     0.9922907  0.9684343  0.9666667  0.9677536  0.9295782
   # 
   # Accuracy was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 2.

TRAIN + METRIC = "ROC"

   set.seed(1)
   TR_roc <- train(Num_Pred,BrC1$Class, method="rf",metric="ROC",
   trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))
   TR_roc
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 615, 614, 614, 614, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9936532  0.9729798  0.9833333  0.9765772  0.9490311
   #   3     0.9936544  0.9729293  0.9791667  0.9750853  0.9457534
   #   4     0.9929957  0.9684343  0.9750000  0.9706948  0.9361373
   #   5     0.9922907  0.9684343  0.9666667  0.9677536  0.9295782
   # 
   # ROC was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 3.

Внутривенно РЕДАКТИРОВАНИЕ CaretSBF

Редактирование функции CaretSBF Summary

   caretSBF$summary <- fiveStats

V. РАЗДЕЛ SBF

Определение sbfControl

   sbfCtrl <- sbfControl(functions=caretSBF, 
   method="repeatedcv", number=10, repeats=1,
   verbose=T, saveDetails = T)

SBF + METRIC = "Точность"

   set.seed(1)
   sbf_acc <- sbf(Num_Pred, BrC1$Class,
   sbfControl = sbfCtrl,
   trControl = trCtrl, method="rf", metric="Accuracy")

   ## sbf_acc  
   sbf_acc

   # Selection By Filter
   # 
   # Outer resampling method: Cross-Validated (10 fold, repeated 1 times) 
   # 
   # Resampling performance:
   # 
   #     ROC  Sens   Spec Accuracy Kappa    ROCSD SensSD  SpecSD AccuracySD  KappaSD
   #  0.9931 0.973 0.9833   0.9766 0.949 0.006272 0.0231 0.02913    0.01226 0.02646
   # 
   # Using the training set, 9 variables were selected:
   #    Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
   # 
   # During resampling, the top 5 selected variables (out of a possible 9):
   #    Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
   # 
   # On average, 9 variables were selected (min = 9, max = 9)

   ## Class of sbf_acc
   class(sbf_acc)
   # [1] "sbf"

   ## Names of elements of sbf_acc
   names(sbf_acc)
   #  [1] "pred"         "variables"    "results"      "fit"          "optVariables"
   #  [6] "call"         "control"      "resample"     "metrics"      "times"       
   # [11] "resampledCM"  "obsLevels"    "dots"        

   ## sbf_acc fit element*  
   sbf_acc$fit
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 614, 614, 615, 615, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9933176  0.9706566  0.9833333  0.9751492  0.9460717
   #   5     0.9920034  0.9662121  0.9791667  0.9707801  0.9363708
   #   9     0.9914825  0.9684343  0.9708333  0.9693308  0.9327662
   # 
   # Accuracy was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 2. 

   ##  Elements of sbf_acc fit  
   names(sbf_acc$fit)
   #  [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
   #  [6] "bestTune"     "call"         "dots"         "metric"       "control"     
   # [11] "finalModel"   "preProcess"   "trainingData" "resample"     "resampledCM" 
   # [16] "perfNames"    "maximize"     "yLimits"      "times"        "levels"      

   ## sbf_acc fit final Model
   sbf_acc$fit$finalModel

   # Call:
   #  randomForest(x = x, y = y, mtry = param$mtry) 
   #                Type of random forest: classification
   #                      Number of trees: 500
   # No. of variables tried at each split: 2
   # 
   #         OOB estimate of  error rate: 2.34%
   # Confusion matrix:
   #           benign malignant class.error
   # benign       431        13  0.02927928
   # malignant      3       236  0.01255230

   ## sbf_acc metric
   sbf_acc$fit$metric
   # [1] "Accuracy"

   ## sbf_acc fit best Tune*  
   sbf_acc$fit$bestTune
   #   mtry
   # 1    2

SBF + METRIC = "ROC"

   set.seed(1)
   sbf_roc <- sbf(Num_Pred, BrC1$Class,
   sbfControl = sbfCtrl,
   trControl = trCtrl, method="rf", metric="ROC")


   ## sbf_roc  
   sbf_roc

   # Selection By Filter
   # 
   # Outer resampling method: Cross-Validated (10 fold, repeated 1 times) 
   # 
   # Resampling performance:
   # 
   #     ROC  Sens   Spec Accuracy Kappa    ROCSD SensSD  SpecSD AccuracySD KappaSD
   #  0.9931 0.973 0.9833   0.9766 0.949 0.006272 0.0231 0.02913    0.01226 0.02646
   # 
   # Using the training set, 9 variables were selected:
   #    Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
   # 
   # During resampling, the top 5 selected variables (out of a possible 9):
   #    Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
   # 
   # On average, 9 variables were selected (min = 9, max = 9)

   ## Class of sbf_roc
   class(sbf_roc)
   # [1] "sbf"

   ## Names of elements of sbf_roc
   names(sbf_roc)
   #  [1] "pred"         "variables"    "results"      "fit"          "optVariables"
   #  [6] "call"         "control"      "resample"     "metrics"      "times"       
   # [11] "resampledCM"  "obsLevels"    "dots"        

   ## sbf_roc fit element*  
   sbf_roc$fit
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 614, 614, 615, 615, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9933176  0.9706566  0.9833333  0.9751492  0.9460717
   #   5     0.9920034  0.9662121  0.9791667  0.9707801  0.9363708
   #   9     0.9914825  0.9684343  0.9708333  0.9693308  0.9327662
   # 
   # ROC was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 2. 

   ##  Elements of sbf_roc fit  
   names(sbf_roc$fit)
   #  [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
   #  [6] "bestTune"     "call"         "dots"         "metric"       "control"     
   # [11] "finalModel"   "preProcess"   "trainingData" "resample"      "resampledCM" 
   # [16] "perfNames"    "maximize"     "yLimits"      "times"        "levels"      

   ## sbf_roc fit final Model
   sbf_roc$fit$finalModel

   # Call:
   #  randomForest(x = x, y = y, mtry = param$mtry) 
   #                Type of random forest: classification
   #                      Number of trees: 500
   # No. of variables tried at each split: 2
   # 
   #         OOB estimate of  error rate: 2.34%
   # Confusion matrix:
   #           benign malignant class.error
   # benign       431        13  0.02927928
   # malignant      3       236  0.01255230

   ## sbf_roc metric
   sbf_roc$fit$metric
   # [1] "ROC"

   ## sbf_roc fit best Tune  
   sbf_roc$fit$bestTune
   #   mtry
   # 1    2

Есть ли sbf() использование metric аргумент для оптимизации модели? Если да, то что metric делает sbf() использовать по умолчанию? Если sbf() использования metric аргумент, то как установить его ROC?

Благодарю.

1

r machine-learning classification r-caret rfe

Источник

user5371744 07 окт '16 в 16:51

1 ответ

Решение

Другие вопросы по тегам r machine-learning classification r-caret rfe

user1078601 14 окт '16 в 12:56 2016-10-14 12:56 · Accepted Answer · 2016-10-14 12:56

sbf doesn't use the metric to optimize anything (unlike rfe); все sbf does is do a feature selection step before calling the model. Of course, you define the filters but there is no way to tune the filter using sbf so no metric is needed to guide that step.

С помощью sbf(x, y, metric = "ROC") пройдет metric = "ROC" to whatever modeling function that you are using (and it designed to work with train когда caretSBF используется. This happens because there is no metric аргумент sbf:

> names(formals(caret:::sbf.default))
[1] "x"          "y"          "sbfControl" "..."