.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/plot_kernel_regression.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_plot_kernel_regression.py>`
        to download the full example code or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_plot_kernel_regression.py:


Kernel Regression
=================

In this example we will see and compare the performance of different kernel
regression methods.

.. GENERATED FROM PYTHON SOURCE LINES 8-24

.. code-block:: Python


    # Author: Elena Petrunina
    # License: MIT

    import numpy as np
    from sklearn.metrics import r2_score
    from sklearn.model_selection import GridSearchCV, train_test_split

    import skfda
    from skfda.misc.hat_matrix import (
        KNeighborsHatMatrix,
        LocalLinearRegressionHatMatrix,
        NadarayaWatsonHatMatrix,
    )
    from skfda.ml.regression._kernel_regression import KernelRegression


.. GENERATED FROM PYTHON SOURCE LINES 25-29

For this example, we will use the
:func:`tecator <skfda.datasets.fetch_tecator>` dataset. This data set
contains 215 samples. For each sample the data consists of a spectrum of
absorbances and the contents of water, fat and protein.

.. GENERATED FROM PYTHON SOURCE LINES 29-35

.. code-block:: Python


    X, y = skfda.datasets.fetch_tecator(return_X_y=True, as_frame=True)
    X = X.iloc[:, 0].values
    fat = y['fat'].values


.. GENERATED FROM PYTHON SOURCE LINES 36-39

Fat percentages will be estimated from the spectrum.
All curves are shown in the image above. The color of these depends on the
amount of fat, from least (yellow) to highest (red).

.. GENERATED FROM PYTHON SOURCE LINES 39-42

.. code-block:: Python


    X.plot(gradient_criteria=fat, legend=True)


.. image-sg:: /auto_examples/images/sphx_glr_plot_kernel_regression_001.png
   :alt: Spectrometric curves
   :srcset: /auto_examples/images/sphx_glr_plot_kernel_regression_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <Figure size 640x480 with 1 Axes>


.. GENERATED FROM PYTHON SOURCE LINES 43-45

The data set is splitted into train and test sets with 80% and 20% of the
samples respectively.

.. GENERATED FROM PYTHON SOURCE LINES 45-53

.. code-block:: Python


    X_train, X_test, y_train, y_test = train_test_split(
        X,
        fat,
        test_size=0.2,
        random_state=1,
    )


.. GENERATED FROM PYTHON SOURCE LINES 54-57

The KNN hat matrix will be tried first. We will use the default kernel
function, i.e. uniform kernel. To find the most suitable number of
neighbours GridSearchCV will be used, testing with any number from 1 to 100.

.. GENERATED FROM PYTHON SOURCE LINES 57-65

.. code-block:: Python


    n_neighbors = np.array(range(1, 100))
    knn = GridSearchCV(
        KernelRegression(kernel_estimator=KNeighborsHatMatrix()),
        param_grid={'kernel_estimator__n_neighbors': n_neighbors},
    )


.. GENERATED FROM PYTHON SOURCE LINES 66-68

The best performance for the train set is obtained with the following number
of neighbours

.. GENERATED FROM PYTHON SOURCE LINES 68-75

.. code-block:: Python


    knn.fit(X_train, y_train)
    print(
        'KNN bandwidth:',
        knn.best_params_['kernel_estimator__n_neighbors'],
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    KNN bandwidth: 3


.. GENERATED FROM PYTHON SOURCE LINES 76-78

The accuracy of the estimation using r2_score measurement on the test set is
shown below.

.. GENERATED FROM PYTHON SOURCE LINES 78-84

.. code-block:: Python


    y_pred = knn.predict(X_test)
    knn_res = r2_score(y_pred, y_test)
    print('Score KNN:', knn_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Score KNN: 0.3500795818805428


.. GENERATED FROM PYTHON SOURCE LINES 85-87

Following a similar procedure for Nadaraya-Watson, the optimal parameter is
chosen from the interval (0.01, 1).

.. GENERATED FROM PYTHON SOURCE LINES 87-94

.. code-block:: Python


    bandwidth = np.logspace(-2, 0, num=100)
    nw = GridSearchCV(
        KernelRegression(kernel_estimator=NadarayaWatsonHatMatrix()),
        param_grid={'kernel_estimator__bandwidth': bandwidth},
    )


.. GENERATED FROM PYTHON SOURCE LINES 95-96

The best performance is obtained with the following bandwidth

.. GENERATED FROM PYTHON SOURCE LINES 96-103

.. code-block:: Python


    nw.fit(X_train, y_train)
    print(
        'Nadaraya-Watson bandwidth:',
        nw.best_params_['kernel_estimator__bandwidth'],
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Nadaraya-Watson bandwidth: 0.37649358067924693


.. GENERATED FROM PYTHON SOURCE LINES 104-106

The accuracy of the estimation is shown below and should be similar to that
obtained with the KNN method.

.. GENERATED FROM PYTHON SOURCE LINES 106-111

.. code-block:: Python


    y_pred = nw.predict(X_test)
    nw_res = r2_score(y_pred, y_test)
    print('Score NW:', nw_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Score NW: 0.3127155617541537


.. GENERATED FROM PYTHON SOURCE LINES 112-119

For Local Linear Regression, FDataBasis representation with a basis should be
used (for the previous cases it is possible to use either
FDataGrid or FDataBasis).

For basis, Fourier basis with 10 elements has been selected. Note that the
number of functions in the basis affects the estimation result and should
ideally also be chosen using cross-validation.

.. GENERATED FROM PYTHON SOURCE LINES 119-138

.. code-block:: Python


    fourier = skfda.representation.basis.FourierBasis(n_basis=10)

    X_basis = X.to_basis(basis=fourier)
    X_basis_train, X_basis_test, y_train, y_test = train_test_split(
        X_basis,
        fat,
        test_size=0.2,
        random_state=1,
    )


    bandwidth = np.logspace(0.3, 1, num=100)

    llr = GridSearchCV(
        KernelRegression(kernel_estimator=LocalLinearRegressionHatMatrix()),
        param_grid={'kernel_estimator__bandwidth': bandwidth},
    )


.. GENERATED FROM PYTHON SOURCE LINES 139-140

The bandwidth obtained by cross-validation is indicated below.

.. GENERATED FROM PYTHON SOURCE LINES 140-146

.. code-block:: Python

    llr.fit(X_basis_train, y_train)
    print(
        'LLR bandwidth:',
        llr.best_params_['kernel_estimator__bandwidth'],
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    LLR bandwidth: 4.7287621998304505


.. GENERATED FROM PYTHON SOURCE LINES 147-149

Although it is a slower method, the result obtained in this example should be
better than in the case of Nadaraya-Watson and KNN.

.. GENERATED FROM PYTHON SOURCE LINES 149-154

.. code-block:: Python


    y_pred = llr.predict(X_basis_test)
    llr_res = r2_score(y_pred, y_test)
    print('Score LLR:', llr_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Score LLR: 0.9731955244187162


.. GENERATED FROM PYTHON SOURCE LINES 155-159

For this data set using the derivative should give a better performance.

Below the plot of all the derivatives can be found. The same scheme as before
is followed: yellow les fat, red more.

.. GENERATED FROM PYTHON SOURCE LINES 159-170

.. code-block:: Python


    Xd = X.derivative()
    Xd.plot(gradient_criteria=fat, legend=True)

    Xd_train, Xd_test, y_train, y_test = train_test_split(
        Xd,
        fat,
        test_size=0.2,
        random_state=1,
    )


.. image-sg:: /auto_examples/images/sphx_glr_plot_kernel_regression_002.png
   :alt: Spectrometric curves
   :srcset: /auto_examples/images/sphx_glr_plot_kernel_regression_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 171-173

Exactly the same operations are repeated, but now with the derivatives of the
functions.

.. GENERATED FROM PYTHON SOURCE LINES 175-176

K-Nearest Neighbours

.. GENERATED FROM PYTHON SOURCE LINES 176-192

.. code-block:: Python

    knn = GridSearchCV(
        KernelRegression(kernel_estimator=KNeighborsHatMatrix()),
        param_grid={'kernel_estimator__n_neighbors': n_neighbors},
    )

    knn.fit(Xd_train, y_train)
    print(
        'KNN bandwidth:',
        knn.best_params_['kernel_estimator__n_neighbors'],
    )

    y_pred = knn.predict(Xd_test)
    dknn_res = r2_score(y_pred, y_test)
    print('Score KNN:', dknn_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    KNN bandwidth: 4
    Score KNN: 0.9428247359478524


.. GENERATED FROM PYTHON SOURCE LINES 193-194

Nadaraya-Watson

.. GENERATED FROM PYTHON SOURCE LINES 194-210

.. code-block:: Python

    bandwidth = np.logspace(-3, -1, num=100)
    nw = GridSearchCV(
        KernelRegression(kernel_estimator=NadarayaWatsonHatMatrix()),
        param_grid={'kernel_estimator__bandwidth': bandwidth},
    )

    nw.fit(Xd_train, y_train)
    print(
        'Nadara-Watson bandwidth:',
        nw.best_params_['kernel_estimator__bandwidth'],
    )

    y_pred = nw.predict(Xd_test)
    dnw_res = r2_score(y_pred, y_test)
    print('Score NW:', dnw_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Nadara-Watson bandwidth: 0.006135907273413175
    Score NW: 0.9491787548158307


.. GENERATED FROM PYTHON SOURCE LINES 211-213

For both Nadaraya-Watson and KNN the accuracy has improved significantly
and should be higher than 0.9.

.. GENERATED FROM PYTHON SOURCE LINES 215-216

Local Linear Regression

.. GENERATED FROM PYTHON SOURCE LINES 216-240

.. code-block:: Python

    Xd_basis = Xd.to_basis(basis=fourier)
    Xd_basis_train, Xd_basis_test, y_train, y_test = train_test_split(
        Xd_basis,
        fat,
        test_size=0.2,
        random_state=1,
    )

    bandwidth = np.logspace(-2, 1, 100)
    llr = GridSearchCV(
        KernelRegression(kernel_estimator=LocalLinearRegressionHatMatrix()),
        param_grid={'kernel_estimator__bandwidth': bandwidth},
    )

    llr.fit(Xd_basis_train, y_train)
    print(
        'LLR bandwidth:',
        llr.best_params_['kernel_estimator__bandwidth'],
    )

    y_pred = llr.predict(Xd_basis_test)
    dllr_res = r2_score(y_pred, y_test)
    print('Score LLR:', dllr_res)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    LLR bandwidth: 0.010722672220103232
    Score LLR: 0.9949460304758446


.. GENERATED FROM PYTHON SOURCE LINES 241-243

LLR accuracy has also improved, but the difference with Nadaraya-Watson and
KNN in the case of derivatives is less significant than in the previous case.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 12.439 seconds)


.. _sphx_glr_download_auto_examples_plot_kernel_regression.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/GAA-UAM/scikit-fda/develop?filepath=examples/plot_kernel_regression.py
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_kernel_regression.ipynb <plot_kernel_regression.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_kernel_regression.py <plot_kernel_regression.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_