5002 K-means Clustering Memo

K-means

First, we get a pre-defined parameter k. We make k random guesses of the mean point. Do iterations until the mean doesn’t change.

  • Assign each data point to the cluster whose mean is nearest
  • Calculate the mean of each cluster
  • Replace the cluster with a new mean

About the initialization of k-means

The popular way to initialize the starting mean is a random choice. The result depends on the initial guess, and a suboptimal result is possible, so we do several different starting points.

Disadvantages of k-means

  • Bad initial guess can lead to no points assigned to the cluster.
  • The value k is not user-friendly, cause we don’t know the number of clusters in the beginning.

Sequential k-means

To deal with data that comes at different times. We choose to update our mean point when a new data point is encountered.

Procedure

We set several initial guess points. We save variable n to represent the number of data points in the cluster(no mean point). Do iterations until interrupted:

  • Get the next data point
  • Find the closest mean point
  • Increment the number counter n
  • Replace the mean point by $m+(1/n)(x - m)$ (m for mean point and x for data point, n for counter)

How does this formula work

When data point t comes in, we get: $m_t = m_{t-1} + \frac 1 t (x - m_{t-1})$ Suppose $m_{t-1} = \sum_{1}^{t-1} x_i / (t - 1)$, We get :$m_{t-1} (t - 1) = \sum_{1}^{t-1} x_i$ $m_{t-1} (t - 1) + x_t = \sum_{1}^{t} x_i$ $m_{t-1} (t - 1)/t + x_t/t = m_t$ $m_t = m_{t-1}+ 1/t( x_t - m_{t-1})$

Forgetful Sequential K-means

Formula

$m_n = (1-a)^n m_0 + a\sum (1-a)^{n-k}x_k$ can be derived as: $m_{i+1} = m_i + a(x-m_i)$ a ranges from 1 to 0.