KMeans#

class skfda.ml.clustering.KMeans(*, n_clusters=2, init=None, metric=LpDistance(p=2, vector_norm=None), n_init=1, max_iter=100, tol=0.0001, random_state=0)[source]#

K-Means algorithm for functional data.

Let \(\mathbf{X = \left\{ x_{1}, x_{2}, ..., x_{n}\right\}}\) be a given dataset to be analyzed, and \(\mathbf{V = \left\{ v_{1}, v_{2}, ..., v_{c}\right\}}\) be the set of centers of clusters in \(\mathbf{X}\) dataset in \(m\) dimensional space \(\left( \mathbb{R}^m \right)\). Where \(n\) is the number of objects, \(m\) is the number of features, and \(c\) is the number of partitions or clusters.

KM iteratively computes cluster centroids in order to minimize the sum with respect to the specified measure. KM algorithm aims at minimizing an objective function known as the squared error function given as follows:

\[J_{KM}\left(\mathbf{X}; \mathbf{V}\right) = \sum_{i=1}^{c} \sum_{j=1}^{n}D_{ij}^2\]

Where, \(D_{ij}^2\) is the squared chosen distance measure which can be any p-norm: \(D_{ij} = \lVert x_{ij} - v_{i} \rVert = \left( \int_I \lvert x_{ij} - v_{i}\rvert^p dx \right)^{ \frac{1}{p}}\), being \(I\) the domain where \(\mathbf{X}\) is defined, \(1 \leqslant i \leqslant c\), \(1 \leqslant j\leqslant n_{i}\). Where \(n_{i}\) represents the number of data points in i-th cluster.

For \(c\) clusters, KM is based on an iterative algorithm minimizing the sum of distances from each observation to its cluster centroid. The observations are moved between clusters until the sum cannot be decreased any more. KM algorithm involves the following steps:

  1. Centroids of \(c\) clusters are chosen from \(\mathbf{X}\)

    randomly or are passed to the function as a parameter.

  2. Distances between data points and cluster centroids are calculated.

  3. Each data point is assigned to the cluster whose centroid is

    closest to it.

  4. Cluster centroids are updated by using the following formula:

    \(\mathbf{v_{i}} ={\sum_{i=1}^{n_{i}}x_{ij}}/n_{i}\) \(1 \leqslant i \leqslant c\).

  5. Distances from the updated cluster centroids are recalculated.

  6. If no data point is assigned to a new cluster the run of algorithm is

    stopped, otherwise the steps from 3 to 5 are repeated for probable movements of data points between the clusters.

This algorithm is applied for each dimension on the image of the FDataGrid object.

Parameters:
  • n_clusters (int) – Number of groups into which the samples are classified. Defaults to 2.

  • init (Input | None) – Contains the initial centers of the different clusters the algorithm starts with. Its data_marix must be of the shape (n_clusters, fdatagrid.ncol, fdatagrid.dim_codomain). Defaults to None, and the centers are initialized randomly.

  • metric (Metric[Input]) – functional data metric. Defaults to l2_distance.

  • n_init (int) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

  • max_iter (int) – Maximum number of iterations of the clustering algorithm for a single run. Defaults to 100.

  • tol (float) – Tolerance used to compare the centroids calculated with the previous ones in every single run of the algorithm.

  • random_state (RandomStateLike) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. Defaults to 0. See Glossary.

Attributes:
  • labels_ – Vector in which each entry contains the cluster each observation belongs to.

  • cluster_centers_ – data_matrix of shape (n_clusters, ncol, dim_codomain) and contains the centroids for each cluster.

  • inertia_ – Sum of squared distances of samples to their closest cluster center for each dimension.

  • n_iter_ – number of iterations the algorithm was run for each dimension.

Example

>>> import skfda
>>> data_matrix = [[1, 1, 2, 3, 2.5, 2],
...                [0.5, 0.5, 1, 2, 1.5, 1],
...                [-1, -1, -0.5, 1, 1, 0.5],
...                [-0.5, -0.5, -0.5, -1, -1, -1]]
>>> grid_points = [0, 2, 4, 6, 8, 10]
>>> fd = skfda.FDataGrid(data_matrix, grid_points)
>>> kmeans = skfda.ml.clustering.KMeans(random_state=0)
>>> kmeans.fit(fd)
KMeans(...)
>>> kmeans.cluster_centers_.data_matrix
array([[[ 0.16666667],
        [ 0.16666667],
        [ 0.83333333],
        [ 2.        ],
        [ 1.66666667],
        [ 1.16666667]],
       [[-0.5       ],
        [-0.5       ],
        [-0.5       ],
        [-1.        ],
        [-1.        ],
        [-1.        ]]])

Methods

fit(X[, y, sample_weight])

Fit the model.

fit_predict(X[, y])

Perform clustering on X and returns cluster labels.

fit_transform(X[, y, sample_weight])

Compute clustering and transform X to cluster-distance space.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(X[, sample_weight])

Predict the closest cluster each sample in X belongs to.

score(X[, y, sample_weight])

Opposite of the value of X on the K-means objective.

set_fit_request(*[, sample_weight])

Request metadata passed to the fit method.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

set_predict_request(*[, sample_weight])

Request metadata passed to the predict method.

set_score_request(*[, sample_weight])

Request metadata passed to the score method.

transform(X)

Transform X to a cluster-distance space.

fit(X, y=None, sample_weight=None)[source]#

Fit the model.

Parameters:
  • X (Input) – Object whose samples are clusered, classified into different groups.

  • y (object) – present here for API consistency by convention.

  • sample_weight (None) – present here for API consistency by convention.

  • self (SelfType) –

Returns:

Fitted model.

Return type:

SelfType

fit_predict(X, y=None)[source]#

Perform clustering on X and returns cluster labels.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input data.

  • y (Ignored) – Not used, present for API consistency by convention.

  • **kwargs (dict) –

    Arguments to be passed to fit.

    New in version 1.4.

Returns:

labels – Cluster labels.

Return type:

ndarray of shape (n_samples,), dtype=np.int64

fit_transform(X, y=None, sample_weight=None)[source]#

Compute clustering and transform X to cluster-distance space.

Parameters:
  • X (Input) – Object whose samples are classified into different groups.

  • y (object) – present here for API consistency by convention.

  • sample_weight (None) – present here for API consistency by convention.

Returns:

Distances of each sample to each cluster.

Return type:

ndarray[Any, dtype[float64]]

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict(X, sample_weight=None)[source]#

Predict the closest cluster each sample in X belongs to.

Parameters:
  • X (Input) – Object whose samples are classified into different groups.

  • sample_weight (None) – present here for API consistency by convention.

Returns:

Label of each sample.

Return type:

ndarray[Any, dtype[int64]]

score(X, y=None, sample_weight=None)[source]#

Opposite of the value of X on the K-means objective.

Parameters:
  • X (Input) – Object whose samples are classified into different groups.

  • y (object) – present here for API consistency by convention.

  • sample_weight (None) – present here for API consistency by convention.

Returns:

Negative inertia_ attribute.

Return type:

float

set_fit_request(*, sample_weight='$UNCHANGED$')#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

  • self (KMeans) –

Returns:

self – The updated object.

Return type:

object

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

New in version 1.4: “polars” option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_predict_request(*, sample_weight='$UNCHANGED$')#

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in predict.

  • self (KMeans) –

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight='$UNCHANGED$')#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (KMeans) –

Returns:

self – The updated object.

Return type:

object

transform(X)[source]#

Transform X to a cluster-distance space.

Parameters:

X (Input) – Object whose samples are classified into different groups.

Returns:

distances of each sample to each cluster.

Return type:

distances_to_centers

Examples using skfda.ml.clustering.KMeans#

Clustering

Clustering

Meteorological data: data visualization, clustering, and functional PCA

Meteorological data: data visualization, clustering, and functional PCA

Scikit-fda and scikit-learn

Scikit-fda and scikit-learn