kmeanstf.kmeanstf.BaseKMeansTF

class kmeanstf.kmeanstf.BaseKMeansTF(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, tunnel=False, max_tunnel_iter=300, max_tunnel_moves_per_iter=100, criterion=1.0, local_trials=1, collect_history=False)

Base class for KMeansTF and TunnelKMeansTF

Note

Recommended usage of this class is via the derived classes KMeansTF and TunnelKMeansTF To use BaseKMeansTF directly, set the parameter tunnel to False (k-means/k-means++) or True (tunnel k-means).

Parameters:
  • n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
  • init ('random', 'k-means++' or array) – method of initialization
  • n_init (int) – number of runs of the initial k-means phase with different initializations (default 1). Only one tunnel phase is performed even if n_init is larger than 1.
  • max_iter (int) – Maximum number of Lloyd iterations for a single run of the k-means algorithm.
  • tol (float) – Relative tolerance with regards to inertia to declare convergence.
  • verbose (int) – Verbosity mode.
  • random_state (int) – None, or integer to seed the random number generators of python, numpy and tensorflow
  • tunnel (boolean) – perform tunnel k-means?
  • max_tunnel_iter (int) – how many tunnel iterations to perform maximally
  • max_tunnel_moves_per_iter (int) – how many centroids to move maximally in one tunnel iteration
  • criterion (float) – inital required ratio error/utility (is increased adaptively)
  • local_trials (int) – how many time should each tunnel move be repeated with different random offset vector (1 or larger)
  • collect_history (bool) – collect historic information on inertia, criterion, tunnel moves, codebooks
Variables:
  • cluster_centers (array, [n_clusters, n_features]) – Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.
  • labels (array, shape(n_samples)) – Labels of each point, i.e. index of closest centroid
  • inertia (float) – Sum of squared distances of samples to their closest cluster center.
  • n_iter (int) – Number of iterations run.
__init__(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, tunnel=False, max_tunnel_iter=300, max_tunnel_moves_per_iter=100, criterion=1.0, local_trials=1, collect_history=False)

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__([n_clusters, init, n_init, …]) Initialize self.
fit(X) Compute k-means clustering.
fit_predict(X) Compute cluster centers and predict cluster index for each sample.
get_errs_and_utils(X[, centroids]) Get error and utility values wrt.
get_gaussian_mixture([n, d, g, sigma]) generate test data from Gaussian mixture distribution
get_history() Get collected history data of performed run of fit().
get_log([abbr]) Get statistics of performed run of fit()
get_params() Get params used to define class
get_system_status([do_print]) print tensorflow version and availability of GPUs.
predict(X) Predict the closest cluster each sample in X belongs to.
self_test([X, n_clusters, n_init, n, d, g, …]) self-testing routine
set_random_seed(seed) setting random seed for tensorflow, python and numpy
fit(X)

Compute k-means clustering.

Parameters:X (tensor) – samples

sets:

  • self.cluster_centers_
  • self.inertia_
fit_predict(X)

Compute cluster centers and predict cluster index for each sample.

Parameters:X (tensor) – samples
Returns:array of cluster indices
get_errs_and_utils(X, centroids=None)

Get error and utility values wrt. X

Parameters:X (tensor) – samples

Error and utility are computed for given centroids or (if centroids = None) for self.cluster_centers_

Returns:errors (array), utilities (array)
static get_gaussian_mixture(n=1000, d=2, g=50, sigma=0.0005)

generate test data from Gaussian mixture distribution

Returns (n,d)-Tensor from mixture of g Gaussians with standard deviation sigma.

get_history()

Get collected history data of performed run of fit().

(only present if collect_history == True)

Returns:history (dict)
get_log(abbr=False)

Get statistics of performed run of fit()

Parameters:abbr (bool) – return with abbreviated keys
Returns:log (dict)
get_params()

Get params used to define class

Returns:params (dict)
static get_system_status(do_print=False)

print tensorflow version and availability of GPUs.

Parameters:do_print (bool) – also print the result

Example output (if do_print==True):

TENSORFLOW: 2.0.0
Physical GPUs: 1   Logical GPUs: 1
Returns:dict with tensorflow version, no of physical GPUs, number of logical GPUs
predict(X)

Predict the closest cluster each sample in X belongs to.

Parameters:X (tensor) – samples
Returns:array of cluster indices
static self_test(X=None, n_clusters=100, n_init=10, n=10000, d=2, g=50, sigma: float = None, verbose=0, stats_only=0, init='k-means++', plot=True, voro=True)

self-testing routine

runs both k-means++ and tunnel k-means and prints the SSE improvement of tunnel k-means over k-means++ (in the scikit-learn implementation). Uses Gaussian mixture distribution (default) or provided data set X. Typical output:

Data is mixture of 50 Gaussians in unit square with sigma=0.00711
algorithm      | data.shape  |   k  | init      | n_init  |     SSE   | Runtime  | Improvement
---------------|-------------|------|-----------|---------|-----------|----------|------------
k-means++      | (10000, 2)  |  100 | k-means++ |      10 |   0.66179 |    2.09s | 0.00%
tunnel k-means | (10000, 2)  |  100 | random    |       1 |   0.63933 |    3.37s | 3.39%
Parameters:
  • X – data set to use (as tensorflow or numpy array). If None, use mixture of Gaussians according to the other parameters
  • n_clusters (int) – the k in k-means
  • n_init (int) – number of runs with different initializations
  • n (int) – number of data points to generate
  • d (int) – number of features (dimensionality) of generated data points
  • g (int) – number of Gaussians
  • sigma (float) – standard deviation of Gaussians, if ‘None’ a value is chosen based on number of Gaussians
  • init ('k-means++' or 'random') – initialization method for k-means (tunnel k-means is initialized as random)
  • plot (bool) – plot the result?
  • voro (bool) – show Voronoi regions in plot?
static set_random_seed(seed)

setting random seed for tensorflow, python and numpy

Parameters:(int) (seed) – random seed