K-均值聚類算法(K-means algorithm)

2019-11-06 06:32:33

字體：大中小

來源：轉載

供稿：網友

k-means clustering is a method of vector quantization, originally from signal PRocessing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

The algorithm has a loose relationship to the k-nearest neighbor classifier, a popular machine learning technique for classification that is often confused with k-means because of the k in the name. One can apply the 1-nearest neighbor classifier on the cluster centers obtained by k-means to classify new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm.

此算法的主要作用：屏幕上很多的點，把相鄰的點聚到離他最近的點。k-means algorithm算法是一個聚類算法，把n個對象根據他們的屬性分為k個分割，k < n。它與處理混合正態分布的最大期望算法很相似，因為他們都試圖找到數據中自然聚類的中心。

聚類（clustering），其實本質就是尋找聯系緊密的事物，把他們區分出來。如果這些事物較少，人為的就可以簡單完成這一目標。但是遇到大規模的數據時，人力就顯得十分無力了。所以我們需要借助計算機來幫助尋找海量數據間的聯系。聚類過程中有一個關鍵的量，這個量就是標識兩個事物之間的關聯度的值，稱為相關距離度量（distance metrics），之前的兩篇博文相似性度量、皮爾遜相似性系數都是計算這種距離度量的方法。根據實際情況的不同，選擇不同的適用的度量方法。這一點十分重要，直接影響聚類的結果是否符合實際需要和情況。K-均值聚類（K-Means Clustering）這個是經典的聚類算法，無論時間復雜度還是空間復雜度都是比較好的。這個算法的名稱已經說明了算法的核心意圖，會對數據進行K個類別的聚類。算法過程就是：1、在數據集里隨機選K個點，當作每個類別的中心點（你也可以通過一定方法選擇K個點）2、通過距離度量，把數據集里的所有點根據距離遠近分配給這K個中心點（即數據分給最近的一個中心點），組成一個類別，即獲得K個類別。3、在獲得的K個類別里進行均值計算，算出新的中心點（根據需求進行不同模型的均值計算，一般就是選個中心點使相應聚類里的所有點到這個點的距離和最小），把得到的中心點替換各個類別的K點值。4、判斷新獲得的一組K值是否和上一次的一組K值相同，如果不同則跳到第2步。如果相同則完成了聚類過程。

http://lib.csdn.net/article/machinelearning/35217

http://blog.pureisle.net/archives/1982.html

http://blog.csdn.net/garfielder007/article/details/51476104

http://blog.csdn.net/abcjennifer/article/details/8170687