Some issues related with Unified State Examination in Informatics in Russian Federation: Parameter estimation using grid search with cross-validation

Производительность модели существенно зависит от значения гиперпараметров. Обратите внимание, что невозможно заранее узнать наилучшие значения гиперпараметров, поэтому в идеале нам нужно попробовать все возможные значения, чтобы узнать оптимальные значения. Выполнение этого вручную может занять значительное количество времени и ресурсов, поэтому мы используем GridSearchCV для автоматизации настройки гиперпараметров.

GridSearchCV — это функция, входящая в пакет model_selection Scikit-learn (или SK-learn). Поэтому важно отметить, что на компьютере должна быть установлена библиотека Scikit-learn. Эта функция помогает перебирать предопределенные гиперпараметры и подгонять вашу оценку (модель) к тренировочному набору. Итак, в итоге мы можем выбрать лучшие параметры из перечисленных гиперпараметров.

В этом примере показано, как классификатор оптимизируется путем перекрестной проверки, которая выполняется с использованием объекта GridSearchCV в наборе для разработки, который содержит только половину доступных размеченных данных.

Производительность выбранных гиперпараметров и обученной модели затем измеряется на специальном оценочном наборе, который не использовался на этапе выбора модели.

(.env) [boris@Server35fedora GRIDCV]$ cat tuningGridSearchCV.py

"""

====================================

Parameter estimation using grid search with cross-validation

====================================

This examples shows how a classifier is optimized by cross-validation,which is done using the class:`~sklearn.model_selection.GridSearchCV` object on a development set that comprises only half of the available labeled data.The performance of the selected hyper-parameters and trained model is then measured on a dedicated evaluation set that was not used during the model selection step.

"""

from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import classification_report

from sklearn.svm import SVC

# Loading the Digits dataset

digits = datasets.load_digits()

# To apply an classifier on this data, we need to flatten the image, to

# turn the data in a (samples, feature) matrix:

n_samples = len(digits.images)

X = digits.images.reshape((n_samples, -1))

y = digits.target

# Split the dataset in two equal parts

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

# Set the parameters by cross-validation

tuned_parameters = [

{"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]},

{"kernel": ["linear"], "C": [1, 10, 100, 1000]},

]

scores = ["precision", "recall"]

for score in scores:

print("# Tuning hyper-parameters for %s" % score)

print()

clf = GridSearchCV(SVC(), tuned_parameters, scoring="%s_macro" % score)

clf.fit(X_train, y_train)

print("Best parameters set found on development set:")

print()

print(clf.best_params_)

print()

print("Grid scores on development set:")

print()

means = clf.cv_results_["mean_test_score"]

stds = clf.cv_results_["std_test_score"]

for mean, std, params in zip(means, stds, clf.cv_results_["params"]):

print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

print()

print("Detailed classification report:")

print()

print("The model is trained on the full development set.")

print("The scores are computed on the full evaluation set.")

print()

y_true, y_pred = y_test, clf.predict(X_test)

print(classification_report(y_true, y_pred))

print()

(.env) [boris@Server35fedora GRIDCV]$ python tuningGridSearchCV.py

# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

Grid scores on development set:

0.986 (+/-0.016) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}

0.959 (+/-0.028) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}

0.988 (+/-0.017) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

0.982 (+/-0.026) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}

0.988 (+/-0.017) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}

0.983 (+/-0.026) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}

0.988 (+/-0.017) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}

0.983 (+/-0.026) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}

0.974 (+/-0.012) for {'C': 1, 'kernel': 'linear'}

0.974 (+/-0.012) for {'C': 10, 'kernel': 'linear'}

0.974 (+/-0.012) for {'C': 100, 'kernel': 'linear'}

0.974 (+/-0.012) for {'C': 1000, 'kernel': 'linear'}

Detailed classification report:

The model is trained on the full development set.

The scores are computed on the full evaluation set.

precision recall f1-score support

0 1.00 1.00 1.00 89

1 0.97 1.00 0.98 90

2 0.99 0.98 0.98 92

3 1.00 0.99 0.99 93

4 1.00 1.00 1.00 76

5 0.99 0.98 0.99 108

6 0.99 1.00 0.99 89

7 0.99 1.00 0.99 78

8 1.00 0.98 0.99 92

9 0.99 0.99 0.99 92

accuracy 0.99 899

macro avg 0.99 0.99 0.99 899

weighted avg 0.99 0.99 0.99 899

# Tuning hyper-parameters for recall

Best parameters set found on development set:

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

Grid scores on development set:

0.986 (+/-0.019) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}

0.957 (+/-0.028) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}

0.987 (+/-0.019) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

0.981 (+/-0.028) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}

0.987 (+/-0.019) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}

0.982 (+/-0.026) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}

0.987 (+/-0.019) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}

0.982 (+/-0.026) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}

0.971 (+/-0.010) for {'C': 1, 'kernel': 'linear'}

0.971 (+/-0.010) for {'C': 10, 'kernel': 'linear'}

0.971 (+/-0.010) for {'C': 100, 'kernel': 'linear'}

0.971 (+/-0.010) for {'C': 1000, 'kernel': 'linear'}

Detailed classification report:

The model is trained on the full development set.

The scores are computed on the full evaluation set.

precision recall f1-score support

0 1.00 1.00 1.00 89

1 0.97 1.00 0.98 90

2 0.99 0.98 0.98 92

3 1.00 0.99 0.99 93

4 1.00 1.00 1.00 76

5 0.99 0.98 0.99 108

6 0.99 1.00 0.99 89

7 0.99 1.00 0.99 78

8 1.00 0.98 0.99 92

9 0.99 0.99 0.99 92

accuracy 0.99 899

macro avg 0.99 0.99 0.99 899

weighted avg 0.99 0.99 0.99 899

***************************************************

Пример использования класса GridSearchCV sklearn, чтобы узнать лучшие параметры модели AdaBoostRegressor для набора данных о ценах на жилье в Бостоне (Python) .

***************************************************

(.env) [boris@Server35fedora GRIDCV]$ cat AdaBoostRegressor.py

from sklearn.datasets import load_boston

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split

from sklearn.ensemble import AdaBoostRegressor

from sklearn.metrics import mean_squared_error, make_scorer, r2_score

import matplotlib.pyplot as plt

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

boston = load_boston()

x, y = boston.data, boston.target

xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

abreg = AdaBoostRegressor()

params = {

'n_estimators': [50, 100],

'learning_rate' : [0.01, 0.05, 0.1, 0.5],

'loss' : ['linear', 'square', 'exponential']

}

score = make_scorer(mean_squared_error)

gridsearch = GridSearchCV(abreg, params, cv=5, return_train_score=True)

gridsearch.fit(xtrain, ytrain)

print(gridsearch.best_params_)

best_estim=gridsearch.best_estimator_

print(best_estim)

best_estim.fit(xtrain,ytrain)

ytr_pred=best_estim.predict(xtrain)

mse = mean_squared_error(ytr_pred,ytrain)

r2 = r2_score(ytr_pred,ytrain)

print("MSE: %.2f" % mse)

print("R2: %.2f" % r2)

ypred=best_estim.predict(xtest)

mse = mean_squared_error(ytest, ypred)

r2 = r2_score(ytest, ypred)

print("MSE: %.2f" % mse)

print("R2: %.2f" % r2)

x_ax = range(len(ytest))

plt.scatter(x_ax, ytest, s=5, color="blue", label="original")

plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")

plt.legend()

plt.show()

REFERECES

https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

Some issues related with Unified State Examination in Informatics in Russian Federation

Wednesday, May 11, 2022

Parameter estimation using grid search with cross-validation

No comments:

Post a Comment

Report Abuse