🗫 ML - Clustering

Jan 20, 2023

Introduction

k-Nearest-Neighbour and K-Means clustering

These two are arguably the two commonly used cluster methods. One of the reasons is that they are easy to use and also somehow straightforward. So how do they work?

k-Nearest-Neighbour: Provide N n-dimension entries with known associated classes for each entry, the number of classes is k, that is, \[ \{\vec{x_i}, y_i\} ,\ \vec{x_i} \in\ {\Re^{n}}\ , y_i\ = \{c_1,...c_k\}, i = 1...N \]

For a new entry \(\vec{v_j}\), to which class should it belong? We need use a distance measure to get the k closest entries of the new entry , the final decision is simple majority vote based the closest k neighbors. The distance metric could be euclidean or other similar ones.

K-means: Given N n-dimension entries and classify them in k classes. At first, we randomly choose k entries and assign them to k clusters. They are the seed classes. Then we calculate the distance between each entry and each class. Each entry will be assigned into one class in terms of the its distance to each class, i.e., assign the entry to its closest class. After the assignment is complete, we then calculate the centroid of each class based on their new members. After the centroid calculation, we go back to the distance calculation and therefore new round classification. We stop the iteration when there is convergence,i.e,, no new centroid and classification.

The two methods are all semi-supervised learning algorithms because they do need we provide the number of clusters prior the clustering.

Workflow using Orange

Workflow using Radiant

Workflow using R

Conclusion

References

K-means Cluster Analysis. UC Business Analytics R Programming Guide https://uc-r.github.io/kmeans_clustering#optimal
Thean C Lim. Clustering: k-means, k-means ++ and gganimate. https://theanlim.rbind.io/post/clustering-k-means-k-means-and-gganimate/