AgglomerativeClustering#

class skfda.ml.clustering.AgglomerativeClustering(n_clusters=2, *, metric=LpDistance(p=2, vector_norm=None), memory=None, connectivity=None, compute_full_tree='auto', linkage, distance_threshold=None)[source]#

Agglomerative Clustering.

Recursively merges the pair of clusters that minimally increases a given linkage distance.

Notes

This class is an extension of sklearn.cluster.AgglomerativeClustering that accepts functional data objects and metrics. Please check also the documentation of the original class.

Parameters:
  • n_clusters (int | None) – The number of clusters to find. It must be None if distance_threshold is not None.

  • metric (MetricOrPrecomputed[MetricElementType]) – Metric used to compute the linkage. If it is skfda.misc.metrics.PRECOMPUTED or the string "precomputed", a distance matrix (instead of a similarity matrix) is needed as input for the fit method.

  • memory (str | joblib.Memory | None) – Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.

  • connectivity (Connectivity[MetricElementType]) – Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is None, i.e, the hierarchical clustering algorithm is unstructured.

  • compute_full_tree (Literal['auto'] | bool) – Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. Otherwise, “auto” is equivalent to False.

  • linkage (LinkageCriterionLike) –

    Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of clusters that minimize this criterion.

    • average uses the average of the distances of each observation of the two sets.

    • complete or maximum linkage uses the maximum distances between all observations of the two sets.

    • single uses the minimum of the distances between all observations of the two sets.

  • distance_threshold (float | None) – The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True.

Attributes:
  • n_clusters_ – The number of clusters found by the algorithm. If distance_threshold=None, it will be equal to the given n_clusters.

  • labels_ – cluster labels for each point

  • n_leaves_ – Number of leaves in the hierarchical tree.

  • n_connected_components_ – The estimated number of connected components in the graph.

  • children_ – The children of each non-leaf node. Values less than n_samples correspond to leaves of the tree which are the original samples. A node i greater than or equal to n_samples is a non-leaf node and has children children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + i

Examples

>>> from skfda import FDataGrid
>>> from skfda.ml.clustering import AgglomerativeClustering
>>> import numpy as np
>>> data_matrix = np.array([[1, 2], [1, 4], [1, 0],
...                        [4, 2], [4, 4], [4, 0]])
>>> X = FDataGrid(data_matrix)
>>> clustering = AgglomerativeClustering(
...     linkage=AgglomerativeClustering.LinkageCriterion.COMPLETE,
... )
>>> clustering.fit(X)
AgglomerativeClustering(...)
>>> clustering.labels_.astype(np.int_)
array([0, 0, 1, 0, 0, 1])

Methods

fit(X[, y])

fit_predict(X[, y])

Perform clustering on X and returns cluster labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

fit(X, y=None)[source]#
Parameters:
  • X (MetricElementType) –

  • y (None) –

Return type:

AgglomerativeClustering[MetricElementType]

fit_predict(X, y=None)[source]#

Perform clustering on X and returns cluster labels.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input data.

  • y (Ignored) – Not used, present for API consistency by convention.

  • **kwargs (dict) –

    Arguments to be passed to fit.

    New in version 1.4.

Returns:

labels – Cluster labels.

Return type:

ndarray of shape (n_samples,), dtype=np.int64

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance