An efficient K-means clustering algorithm for tall data
MetadataShow full item record
The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial topic in unsupervised learning. Cluster analysis algorithms are a key element of exploratory data analysis and, among them, the K-means algorithm stands out as the most popular approach due to its easiness in the implementation, straightforward parallelizability and relatively low computational cost. Unfortunately, the K-means algorithm also has some drawbacks that have been extensively studied, such as its high dependency on the initial conditions, as well as to the fact that it might not scale well on massive datasets. In this article, we propose a recursive and parallel approximation to the K-means algorithm that scales well on the number of instances of the problem, without affecting the quality of the approximation. In order to achieve this, instead of analyzing the entire dataset, we work on small weighted sets of representative points that are distributed in such a way that more importance is given to those regions where it is harder to determine the correct cluster assignment of the original instances. In addition to different theoretical properties, which explain the reasoning behind the algorithm, experimental results indicate that our method outperforms the state-of-the-art in terms of the trade-off between number of distance computations and the quality of the solution obtained.