MSPlotOutlierDetector#

class skfda.exploratory.outliers.MSPlotOutlierDetector(*, multivariate_depth=None, pointwise_weights=None, assume_centered=False, support_fraction=None, num_resamples=1000, random_state=0, cutoff_factor=1, _force_asymptotic=False)[source]#

Outlier detector using directional outlyingness.

Considering \(\mathbf{Y} = \left(\mathbf{MO}^T, VO\right)^T\), the outlier detection method is implemented as described below.

First, the square robust Mahalanobis distance is calculated based on a sample of size \(h \leq fdatagrid.n_samples\):

\[{RMD}^2\left( \mathbf{Y}, \mathbf{\tilde{Y}}^*_J\right) = \left( \mathbf{Y} - \mathbf{\tilde{Y}}^*_J\right)^T {\mathbf{S}^*_J}^{-1} \left( \mathbf{Y} - \mathbf{\tilde{Y}}^*_J\right)\]

where \(J\) denotes the group of \(h\) samples that minimizes the determinant of the corresponding covariance matrix, \(\mathbf{\tilde{Y}}^*_J = h^{-1}\sum_{i\in{J}}\mathbf{Y}_i\) and \(\mathbf{S}^*_J = h^{-1}\sum_{i\in{J}}\left( \mathbf{Y}_i - \mathbf{ \tilde{Y}}^*_J\right) \left( \mathbf{Y}_i - \mathbf{\tilde{Y}}^*_J \right)^T\). The sub-sample of size h controls the robustness of the method.

Then, the tail of this distance distribution is approximated as follows:

\[\frac{c\left(m - p\right)}{m\left(p + 1\right)}RMD^2\left( \mathbf{Y}, \mathbf{\tilde{Y}}^*_J\right)\sim F_{p+1, m-p}\]

where \(p\) is the dimension of the image plus one, and \(c\) and \(m\) are parameters determining the degrees of freedom of the \(F\)-distribution and the scaling factor, given by empirical results and an asymptotic formula.

Finally, we choose a cutoff value to determine the outliers, C , as the \(\alpha\) quantile of \(F_{p+1, m-p}\). We set \(\alpha = 0.993\), which is used in the classical boxplot for detecting outliers under a normal distribution.

Parameters:
  • multivariate_depth (Depth[NDArrayFloat] | None) – Method used to order the data. Defaults to projection depth.

  • pointwise_weights (NDArrayFloat | None) – an array containing the weights of each points of discretisati on where values have been recorded.

  • cutoff_factor (float) – Factor that multiplies the cutoff value, in order to consider more or less curves as outliers.

  • assume_centered (bool) – If True, the support of the robust location and the covariance estimates is computed, and a covariance estimate is recomputed from it, without centering the data. Useful to work with data whose mean is significantly equal to zero but is not exactly zero. If False, default value, the robust location and covariance are directly computed with the FastMCD algorithm without additional treatment.

  • support_fraction (float | None) – The proportion of points to be included in the support of the raw MCD estimate. Default is None, which implies that the minimum value of support_fraction will be used within the algorithm: [n_sample + n_features + 1] / 2

  • random_state (RandomStateLike) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. By default, it is 0.

  • num_resamples (int) –

  • _force_asymptotic (bool) –

Example

Function \(f : \mathbb{R}\longmapsto\mathbb{R}\).

>>> import skfda
>>> data_matrix = [[1, 1, 2, 3, 2.5, 2],
...                [0.5, 0.5, 1, 2, 1.5, 1],
...                [-1, -1, -0.5, 1, 1, 0.5],
...                [-0.5, -0.5, -0.5, -1, -1, -1]]
>>> grid_points = [0, 2, 4, 6, 8, 10]
>>> fd = skfda.FDataGrid(data_matrix, grid_points)
>>> out_detector = MSPlotOutlierDetector()
>>> out_detector.fit_predict(fd)
array([1, 1, 1, 1])

References

Dai, Wenlin, and Genton, Marc G. “Multivariate functional data visualization and outlier detection.” Journal of Computational and Graphical Statistics 27.4 (2018): 923-934.

Methods

fit_predict(X[, y])

Perform fit on X and returns labels for X.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

fit_predict(X, y=None)[source]#

Perform fit on X and returns labels for X.

Returns -1 for outliers and 1 for inliers.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples.

  • y (Ignored) – Not used, present for API consistency by convention.

  • **kwargs (dict) –

    Arguments to be passed to fit.

    New in version 1.4.

Returns:

y – 1 for inliers, -1 for outliers.

Return type:

ndarray of shape (n_samples,)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance