K-means for Evolving Data Streams
Abstract
Nowadays, streaming data analysis has become a relevant area of research in machine learning. Most of the data streams available are unlabeled, and thus it is necessary to develop specific clustering techniques that take into account the particularities of the streaming data. In streaming data scenarios, the data is composed of an increasing sequence of batches of samples where the concept drift phenomenon may occur. In this work, we formally define the streaming K -means (SKM) problem, which implies a restart of the error function when a concept drift occurs. An approximated error function that does not rely on concept drift detection is proposed. We prove that such a surrogate is a good approximation of the SKM error. Then, we introduce an algorithm to deal with SKM problem by minimizing the surrogate error function each time a new batch arrives. Alternative initialization criteria are presented and theoretically analyzed for streaming data scenarios. Among them, we develop and analyze theoretically two initialization methods that search for the best trade-off between the importance that is given to the past and the current batches. The experiments show that the proposed algorithm with, the proposed initialization criteria, obtain the best results when dealing with the SKM problem without requiring to detect when concept drift takes place.