kmeanstf.kmeanstf.BaseKMeansTF¶
-
class
kmeanstf.kmeanstf.
BaseKMeansTF
(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, tunnel=False, max_tunnel_iter=300, max_tunnel_moves_per_iter=100, criterion=1.0, local_trials=1, collect_history=False) Base class for
KMeansTF
andTunnelKMeansTF
Note
Recommended usage of this class is via the derived classes
KMeansTF
andTunnelKMeansTF
To useBaseKMeansTF
directly, set the parametertunnel
to False (k-means/k-means++) or True (tunnel k-means).Parameters: - n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
- init ('random', 'k-means++' or array) – method of initialization
- n_init (int) – number of runs of the initial k-means phase with different initializations (default 1). Only one tunnel phase is performed even if n_init is larger than 1.
- max_iter (int) – Maximum number of Lloyd iterations for a single run of the k-means algorithm.
- tol (float) – Relative tolerance with regards to inertia to declare convergence.
- verbose (int) – Verbosity mode.
- random_state (int) – None, or integer to seed the random number generators of python, numpy and tensorflow
- tunnel (boolean) – perform tunnel k-means?
- max_tunnel_iter (int) – how many tunnel iterations to perform maximally
- max_tunnel_moves_per_iter (int) – how many centroids to move maximally in one tunnel iteration
- criterion (float) – inital required ratio error/utility (is increased adaptively)
- local_trials (int) – how many time should each tunnel move be repeated with different random offset vector (1 or larger)
- collect_history (bool) – collect historic information on inertia, criterion, tunnel moves, codebooks
Variables: - cluster_centers (array, [n_clusters, n_features]) – Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.
- labels (array, shape(n_samples)) – Labels of each point, i.e. index of closest centroid
- inertia (float) – Sum of squared distances of samples to their closest cluster center.
- n_iter (int) – Number of iterations run.
-
__init__
(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, tunnel=False, max_tunnel_iter=300, max_tunnel_moves_per_iter=100, criterion=1.0, local_trials=1, collect_history=False)¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
([n_clusters, init, n_init, …])Initialize self. fit
(X)Compute k-means clustering. fit_predict
(X)Compute cluster centers and predict cluster index for each sample. get_errs_and_utils
(X[, centroids])Get error and utility values wrt. get_gaussian_mixture
([n, d, g, sigma])generate test data from Gaussian mixture distribution get_history
()Get collected history data of performed run of fit(). get_log
([abbr])Get statistics of performed run of fit() get_params
()Get params used to define class get_system_status
([do_print])print tensorflow version and availability of GPUs. predict
(X)Predict the closest cluster each sample in X belongs to. self_test
([X, n_clusters, n_init, n, d, g, …])self-testing routine set_random_seed
(seed)setting random seed for tensorflow, python and numpy -
fit
(X) Compute k-means clustering.
Parameters: X (tensor) – samples sets:
- self.cluster_centers_
- self.inertia_
-
fit_predict
(X) Compute cluster centers and predict cluster index for each sample.
Parameters: X (tensor) – samples Returns: array of cluster indices
-
get_errs_and_utils
(X, centroids=None) Get error and utility values wrt. X
Parameters: X (tensor) – samples Error and utility are computed for given centroids or (if centroids = None) for self.cluster_centers_
Returns: errors (array), utilities (array)
-
static
get_gaussian_mixture
(n=1000, d=2, g=50, sigma=0.0005) generate test data from Gaussian mixture distribution
Returns (n,d)-Tensor from mixture of g Gaussians with standard deviation sigma.
-
get_history
() Get collected history data of performed run of fit().
(only present if collect_history == True)
Returns: history (dict)
-
get_log
(abbr=False) Get statistics of performed run of fit()
Parameters: abbr (bool) – return with abbreviated keys Returns: log (dict)
-
get_params
() Get params used to define class
Returns: params (dict)
-
static
get_system_status
(do_print=False) print tensorflow version and availability of GPUs.
Parameters: do_print (bool) – also print the result Example output (if do_print==True):
TENSORFLOW: 2.0.0 Physical GPUs: 1 Logical GPUs: 1
Returns: dict with tensorflow version, no of physical GPUs, number of logical GPUs
-
predict
(X) Predict the closest cluster each sample in X belongs to.
Parameters: X (tensor) – samples Returns: array of cluster indices
-
static
self_test
(X=None, n_clusters=100, n_init=10, n=10000, d=2, g=50, sigma: float = None, verbose=0, stats_only=0, init='k-means++', plot=True, voro=True) self-testing routine
runs both k-means++ and tunnel k-means and prints the SSE improvement of tunnel k-means over k-means++ (in the scikit-learn implementation). Uses Gaussian mixture distribution (default) or provided data set X. Typical output:
Data is mixture of 50 Gaussians in unit square with sigma=0.00711 algorithm | data.shape | k | init | n_init | SSE | Runtime | Improvement ---------------|-------------|------|-----------|---------|-----------|----------|------------ k-means++ | (10000, 2) | 100 | k-means++ | 10 | 0.66179 | 2.09s | 0.00% tunnel k-means | (10000, 2) | 100 | random | 1 | 0.63933 | 3.37s | 3.39%
Parameters: - X – data set to use (as tensorflow or numpy array). If None, use mixture of Gaussians according to the other parameters
- n_clusters (int) – the k in k-means
- n_init (int) – number of runs with different initializations
- n (int) – number of data points to generate
- d (int) – number of features (dimensionality) of generated data points
- g (int) – number of Gaussians
- sigma (float) – standard deviation of Gaussians, if ‘None’ a value is chosen based on number of Gaussians
- init ('k-means++' or 'random') – initialization method for k-means (tunnel k-means is initialized as random)
- plot (bool) – plot the result?
- voro (bool) – show Voronoi regions in plot?
-
static
set_random_seed
(seed) setting random seed for tensorflow, python and numpy
Parameters: (int) (seed) – random seed