.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_kernel_regression.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_kernel_regression.py: Kernel Regression ================= In this example we will see and compare the performance of different kernel regression methods. .. GENERATED FROM PYTHON SOURCE LINES 8-24 .. code-block:: Python # Author: Elena Petrunina # License: MIT import numpy as np from sklearn.metrics import r2_score from sklearn.model_selection import GridSearchCV, train_test_split import skfda from skfda.misc.hat_matrix import ( KNeighborsHatMatrix, LocalLinearRegressionHatMatrix, NadarayaWatsonHatMatrix, ) from skfda.ml.regression._kernel_regression import KernelRegression .. GENERATED FROM PYTHON SOURCE LINES 25-29 For this example, we will use the :func:`tecator ` dataset. This data set contains 215 samples. For each sample the data consists of a spectrum of absorbances and the contents of water, fat and protein. .. GENERATED FROM PYTHON SOURCE LINES 29-35 .. code-block:: Python X, y = skfda.datasets.fetch_tecator(return_X_y=True, as_frame=True) X = X.iloc[:, 0].values fat = y['fat'].values .. GENERATED FROM PYTHON SOURCE LINES 36-39 Fat percentages will be estimated from the spectrum. All curves are shown in the image above. The color of these depends on the amount of fat, from least (yellow) to highest (red). .. GENERATED FROM PYTHON SOURCE LINES 39-42 .. code-block:: Python X.plot(gradient_criteria=fat, legend=True) .. image-sg:: /auto_examples/images/sphx_glr_plot_kernel_regression_001.png :alt: Spectrometric curves :srcset: /auto_examples/images/sphx_glr_plot_kernel_regression_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none
.. GENERATED FROM PYTHON SOURCE LINES 43-45 The data set is splitted into train and test sets with 80% and 20% of the samples respectively. .. GENERATED FROM PYTHON SOURCE LINES 45-53 .. code-block:: Python X_train, X_test, y_train, y_test = train_test_split( X, fat, test_size=0.2, random_state=1, ) .. GENERATED FROM PYTHON SOURCE LINES 54-57 The KNN hat matrix will be tried first. We will use the default kernel function, i.e. uniform kernel. To find the most suitable number of neighbours GridSearchCV will be used, testing with any number from 1 to 100. .. GENERATED FROM PYTHON SOURCE LINES 57-65 .. code-block:: Python n_neighbors = np.array(range(1, 100)) knn = GridSearchCV( KernelRegression(kernel_estimator=KNeighborsHatMatrix()), param_grid={'kernel_estimator__n_neighbors': n_neighbors}, ) .. GENERATED FROM PYTHON SOURCE LINES 66-68 The best performance for the train set is obtained with the following number of neighbours .. GENERATED FROM PYTHON SOURCE LINES 68-75 .. code-block:: Python knn.fit(X_train, y_train) print( 'KNN bandwidth:', knn.best_params_['kernel_estimator__n_neighbors'], ) .. rst-class:: sphx-glr-script-out .. code-block:: none KNN bandwidth: 3 .. GENERATED FROM PYTHON SOURCE LINES 76-78 The accuracy of the estimation using r2_score measurement on the test set is shown below. .. GENERATED FROM PYTHON SOURCE LINES 78-84 .. code-block:: Python y_pred = knn.predict(X_test) knn_res = r2_score(y_pred, y_test) print('Score KNN:', knn_res) .. rst-class:: sphx-glr-script-out .. code-block:: none Score KNN: 0.3500795818805428 .. GENERATED FROM PYTHON SOURCE LINES 85-87 Following a similar procedure for Nadaraya-Watson, the optimal parameter is chosen from the interval (0.01, 1). .. GENERATED FROM PYTHON SOURCE LINES 87-94 .. code-block:: Python bandwidth = np.logspace(-2, 0, num=100) nw = GridSearchCV( KernelRegression(kernel_estimator=NadarayaWatsonHatMatrix()), param_grid={'kernel_estimator__bandwidth': bandwidth}, ) .. GENERATED FROM PYTHON SOURCE LINES 95-96 The best performance is obtained with the following bandwidth .. GENERATED FROM PYTHON SOURCE LINES 96-103 .. code-block:: Python nw.fit(X_train, y_train) print( 'Nadaraya-Watson bandwidth:', nw.best_params_['kernel_estimator__bandwidth'], ) .. rst-class:: sphx-glr-script-out .. code-block:: none Nadaraya-Watson bandwidth: 0.37649358067924693 .. GENERATED FROM PYTHON SOURCE LINES 104-106 The accuracy of the estimation is shown below and should be similar to that obtained with the KNN method. .. GENERATED FROM PYTHON SOURCE LINES 106-111 .. code-block:: Python y_pred = nw.predict(X_test) nw_res = r2_score(y_pred, y_test) print('Score NW:', nw_res) .. rst-class:: sphx-glr-script-out .. code-block:: none Score NW: 0.3127155617541537 .. GENERATED FROM PYTHON SOURCE LINES 112-119 For Local Linear Regression, FDataBasis representation with a basis should be used (for the previous cases it is possible to use either FDataGrid or FDataBasis). For basis, Fourier basis with 10 elements has been selected. Note that the number of functions in the basis affects the estimation result and should ideally also be chosen using cross-validation. .. GENERATED FROM PYTHON SOURCE LINES 119-138 .. code-block:: Python fourier = skfda.representation.basis.FourierBasis(n_basis=10) X_basis = X.to_basis(basis=fourier) X_basis_train, X_basis_test, y_train, y_test = train_test_split( X_basis, fat, test_size=0.2, random_state=1, ) bandwidth = np.logspace(0.3, 1, num=100) llr = GridSearchCV( KernelRegression(kernel_estimator=LocalLinearRegressionHatMatrix()), param_grid={'kernel_estimator__bandwidth': bandwidth}, ) .. GENERATED FROM PYTHON SOURCE LINES 139-140 The bandwidth obtained by cross-validation is indicated below. .. GENERATED FROM PYTHON SOURCE LINES 140-146 .. code-block:: Python llr.fit(X_basis_train, y_train) print( 'LLR bandwidth:', llr.best_params_['kernel_estimator__bandwidth'], ) .. rst-class:: sphx-glr-script-out .. code-block:: none LLR bandwidth: 4.7287621998304505 .. GENERATED FROM PYTHON SOURCE LINES 147-149 Although it is a slower method, the result obtained in this example should be better than in the case of Nadaraya-Watson and KNN. .. GENERATED FROM PYTHON SOURCE LINES 149-154 .. code-block:: Python y_pred = llr.predict(X_basis_test) llr_res = r2_score(y_pred, y_test) print('Score LLR:', llr_res) .. rst-class:: sphx-glr-script-out .. code-block:: none Score LLR: 0.9731955244187162 .. GENERATED FROM PYTHON SOURCE LINES 155-159 For this data set using the derivative should give a better performance. Below the plot of all the derivatives can be found. The same scheme as before is followed: yellow les fat, red more. .. GENERATED FROM PYTHON SOURCE LINES 159-170 .. code-block:: Python Xd = X.derivative() Xd.plot(gradient_criteria=fat, legend=True) Xd_train, Xd_test, y_train, y_test = train_test_split( Xd, fat, test_size=0.2, random_state=1, ) .. image-sg:: /auto_examples/images/sphx_glr_plot_kernel_regression_002.png :alt: Spectrometric curves :srcset: /auto_examples/images/sphx_glr_plot_kernel_regression_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 171-173 Exactly the same operations are repeated, but now with the derivatives of the functions. .. GENERATED FROM PYTHON SOURCE LINES 175-176 K-Nearest Neighbours .. GENERATED FROM PYTHON SOURCE LINES 176-192 .. code-block:: Python knn = GridSearchCV( KernelRegression(kernel_estimator=KNeighborsHatMatrix()), param_grid={'kernel_estimator__n_neighbors': n_neighbors}, ) knn.fit(Xd_train, y_train) print( 'KNN bandwidth:', knn.best_params_['kernel_estimator__n_neighbors'], ) y_pred = knn.predict(Xd_test) dknn_res = r2_score(y_pred, y_test) print('Score KNN:', dknn_res) .. rst-class:: sphx-glr-script-out .. code-block:: none KNN bandwidth: 4 Score KNN: 0.9428247359478524 .. GENERATED FROM PYTHON SOURCE LINES 193-194 Nadaraya-Watson .. GENERATED FROM PYTHON SOURCE LINES 194-210 .. code-block:: Python bandwidth = np.logspace(-3, -1, num=100) nw = GridSearchCV( KernelRegression(kernel_estimator=NadarayaWatsonHatMatrix()), param_grid={'kernel_estimator__bandwidth': bandwidth}, ) nw.fit(Xd_train, y_train) print( 'Nadara-Watson bandwidth:', nw.best_params_['kernel_estimator__bandwidth'], ) y_pred = nw.predict(Xd_test) dnw_res = r2_score(y_pred, y_test) print('Score NW:', dnw_res) .. rst-class:: sphx-glr-script-out .. code-block:: none Nadara-Watson bandwidth: 0.006135907273413175 Score NW: 0.9491787548158307 .. GENERATED FROM PYTHON SOURCE LINES 211-213 For both Nadaraya-Watson and KNN the accuracy has improved significantly and should be higher than 0.9. .. GENERATED FROM PYTHON SOURCE LINES 215-216 Local Linear Regression .. GENERATED FROM PYTHON SOURCE LINES 216-240 .. code-block:: Python Xd_basis = Xd.to_basis(basis=fourier) Xd_basis_train, Xd_basis_test, y_train, y_test = train_test_split( Xd_basis, fat, test_size=0.2, random_state=1, ) bandwidth = np.logspace(-2, 1, 100) llr = GridSearchCV( KernelRegression(kernel_estimator=LocalLinearRegressionHatMatrix()), param_grid={'kernel_estimator__bandwidth': bandwidth}, ) llr.fit(Xd_basis_train, y_train) print( 'LLR bandwidth:', llr.best_params_['kernel_estimator__bandwidth'], ) y_pred = llr.predict(Xd_basis_test) dllr_res = r2_score(y_pred, y_test) print('Score LLR:', dllr_res) .. rst-class:: sphx-glr-script-out .. code-block:: none LLR bandwidth: 0.010722672220103232 Score LLR: 0.9949460304758446 .. GENERATED FROM PYTHON SOURCE LINES 241-243 LLR accuracy has also improved, but the difference with Nadaraya-Watson and KNN in the case of derivatives is less significant than in the previous case. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 12.439 seconds) .. _sphx_glr_download_auto_examples_plot_kernel_regression.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/GAA-UAM/scikit-fda/develop?filepath=examples/plot_kernel_regression.py :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_kernel_regression.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_kernel_regression.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_