5002 Other Clustering Memo
Methods
- Model-Based Clustering (EM algorithm)
- Density-Based Clustering (DBSCAN)
- Scalable Clustering Method (BIRCH)
EM algorithm
why?
- previous method, each point belongs to a single cluster. No point belongs to different cluster with probabilities.
Normal distribution
$p(x|<\mu, \sigma>)=\frac{1}{\sqrt{2\pi}\sigma}e^{- \frac{(x-\mu)^2}{2\sigma^2}}$ sigma = standard derivation, mu = mean.
Procedure
- Initialize all $\mu_i$ and $\sigma_i$ (random)
- For each point x, we calculate its probability belong to cluster i. (formula: $p(x\in C_i) = \frac{p(x|<\mu_i, \sigma_i>)} { \sum p(x|<\mu, \sigma>})$, i.e. probability of cluster i divided by the sum of probability of all the clusters)
- We calculate the new mean of cluster, and update $\mu$. using formula($\mu_i = \sum_{X} x \frac{p(x\in C_i)}{\sum_y p(y\in C_i)}$. (i.e. according to the probabilities that all points belong to cluster i) Repeat until converge
DBSCAN
why?
traditional clustering cannot handle irregular shaped cluster
Concepts
$\epsilon$-neighborhood
denote by N(p), is the set of points that distance with p is within $\epsilon$ (including itself).
Kinds of point
- core points, N(p) is at least MinPts(a parameter).
- border points, Not core point, but N(p) contains at least one core point
- noise points, not core or border
procedure
- Each cluster contains at least one core point (create cluster from core point)
- If core point p and q, and N(p) contain q (or inverse), merge the cluster into one. (they are in the same cluster)
- Border point is assigned to previous created cluster if it contains the core point (arbitrarily if many).
- All noise points do not belong to any.
BIRCH
why?
most previous algorithm cannot handle update, and not scalable.
Concepts
- Mean $\mu = \frac{\sum x}{n}$
- Radius $R = \sqrt{\frac{\sum (x-\mu)^2}{n}}$, average distance from member to the mean.
- Diameter $D=\sqrt{\frac{\sum\sum(x_i-x_j)^2}{n(n-1)}}$, average pair-wise distance within a cluster.
Procedure
L = {} when there is a new data point x comes in:
- If L is empty: create cluster C containing x, and insert into L.
- If not empty: Find closest cluster C of x, insert x into C. When C has diameter D greater than a given threshold, splite C into 2 sub-clusters, and insert them into L, remove the old one. (split: using dendrogram or other)
Acceleration
- n: no. of points in the cluster
- LS: the linear sum of n points $\sum x_i$
- SS: the square sum of the n points $\sum x^2$ Update them is feasible and fast. Comparison of time consumption of each update: O(n^2) and O(1).