Как проверить переоснащение при перекрестной проверке регрессии с помощью GridSearchCV?

Question

Как проверить переоснащение при перекрестной проверке регрессии с помощью GridSearchCV?

Я использую регрессионную модель набора непрерывных переменных и непрерывной цели. Это мой код:

      def run_RandomForest(xTrain,yTrain,xTest,yTest):
  cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

  # define the pipeline to evaluate
  model = RandomForestRegressor()
  fs = SelectKBest(score_func=mutual_info_regression)
  pipeline = Pipeline(steps=[('sel',fs), ('rf', model)])

  # define the grid
  grid = dict()
  grid['sel__k'] = [i for i in range(1, xTrain.shape[1]+1)]
  search = GridSearchCV(
        pipeline,
        param_grid={
            'rf__bootstrap': [True, False],
            'rf__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
            'rf__max_features': ['auto', 'sqrt'],
            'rf__min_samples_leaf': [1, 2, 4],
            'rf__min_samples_split': [2, 5, 10],
            'rf__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
        },
        scoring='neg_mean_squared_error',
        return_train_score=True,
        verbose=1,
        cv=5,
        n_jobs=-1)

  # perform the fitting
  results = search.fit(xTrain, yTrain)

  # predict prices of X_test
  y_pred = results.predict(xTest)

run_RandomForest(x_train,y_train,x_test_y_test)

Я хочу понять, не является ли эта модель чрезмерной. Я читал, что включение перекрестной проверки - эффективный способ проверить это.

Как видите, я включил cv в приведенный выше код. Однако я полностью застрял на следующем шаге. Может ли кто-нибудь продемонстрировать мне код, который будет принимать информацию о резюме и создавать график или набор статистических данных, которые я должен анализировать на предмет переобучения? Я знаю, что есть некоторые подобные вопросы по SO (например, здесь и здесь ), но я не понимаю ни один из них, как конкретно перевести в мою ситуацию, потому что в обоих этих примерах они просто инициализируют модель и подгоняют ее. , а мой включает GridSearchCV?

0

python machine-learning scikit-learn cross-validation overfitting-underfitting

Источник

user8407951 14 мар '21 в 19:37

1 ответ

Другие вопросы по тегам python machine-learning scikit-learn cross-validation overfitting-underfitting

user5212614 16 мар '21 в 05:37 2021-03-16 05:37 · Answer 1 · 2021-03-16 05:37

Вы, безусловно, можете настроить гиперпараметры, которые контролируют количество функций, которые выбираются случайным образом для роста каждого дерева из загруженных данных. Обычно это делается с помощью k-кратной перекрестной проверки; выберите параметр настройки, который минимизирует ошибку предсказания тестовой выборки. Кроме того, выращивание более крупного леса повысит точность прогнозов, хотя обычно наблюдается уменьшение отдачи, когда вы вырастаете до нескольких сотен деревьев.

Попробуйте этот пример кода.

      from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state = 42)
from pprint import pprint # Look at parameters used by our current forest

print(rf.get_params())

Результат:

      {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

Также...

      import numpy as np
from sklearn.model_selection import RandomizedSearchCV # Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

Результат:

      {'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

См. Эту ссылку для получения дополнительной информации.

https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6

Вот пример кода для перекрестной проверки.

      # import random search, random forest, iris data, and distributions
from sklearn.model_selection import cross_validate
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier

# get iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target


model = RandomForestClassifier(random_state=1)
cv = cross_validate(model, X, y, cv=5)
print(cv)
print(cv['test_score'])
print(cv['test_score'].mean())

Результат:

      {'fit_time': array([0.18350697, 0.14461398, 0.14261866, 0.13116884, 0.15478826]), 'score_time': array([0.01496148, 0.00997281, 0.00897574, 0.00797844, 0.01396227]), 'test_score': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.        ])}
[0.96666667 0.96666667 0.93333333 0.96666667 1.        ]
0.9666666666666668

Внутренняя работа перекрестной проверки:

      Shuffle the dataset in order to remove any kind of order
Split the data into K number of folds. K= 5 or 10 will work for most of the cases
Now keep one fold for testing and remaining all the folds for training
Train(fit) the model on train set and test(evaluate) it on test set and note down the results for that split
Now repeat this process for all the folds, every time choosing separate fold as test data
So for every iteration our model gets trained and tested on different sets of data
At the end sum up the scores from each split and get the mean score