5002 K-means Clustering Memo
K-means
First, we get a pre-defined parameter k. We make k random guesses of the mean point. Do iterations until the mean doesn’t change.
- Assign each data point to the cluster whose mean is nearest
- Calculate the mean of each cluster
- Replace the cluster with a new mean
About the initialization of k-means
The popular way to initialize the starting mean is a random choice. The result depends on the initial guess, and a suboptimal result is possible, so we do several different starting points.
Disadvantages of k-means
- Bad initial guess can lead to no points assigned to the cluster.
- The value k is not user-friendly, cause we don’t know the number of clusters in the beginning.
Sequential k-means
To deal with data that comes at different times. We choose to update our mean point when a new data point is encountered.
Procedure
We set several initial guess points. We save variable n to represent the number of data points in the cluster(no mean point). Do iterations until interrupted:
- Get the next data point
- Find the closest mean point
- Increment the number counter n
- Replace the mean point by $m+(1/n)(x - m)$ (m for mean point and x for data point, n for counter)
How does this formula work
When data point t comes in, we get: $m_t = m_{t-1} + \frac 1 t (x - m_{t-1})$ Suppose $m_{t-1} = \sum_{1}^{t-1} x_i / (t - 1)$, We get :$m_{t-1} (t - 1) = \sum_{1}^{t-1} x_i$ $m_{t-1} (t - 1) + x_t = \sum_{1}^{t} x_i$ $m_{t-1} (t - 1)/t + x_t/t = m_t$ $m_t = m_{t-1}+ 1/t( x_t - m_{t-1})$
Forgetful Sequential K-means
Formula
$m_n = (1-a)^n m_0 + a\sum (1-a)^{n-k}x_k$ can be derived as: $m_{i+1} = m_i + a(x-m_i)$ a ranges from 1 to 0.