## K-means for massive data

##### Abstract

The $K$-means algorithm is undoubtedly one of the most popular clustering analysis techniques, due to its easiness in the implementation, straightforward parallelizability
and competitive computational complexity, when compared to more sophisticated clustering alternatives. However, the
progressive growth of the amount of data that needs to be analyzed, in a wide variety of scientific fields, represents
a significant challenge for the $K$-means algorithm, since its time complexity is dominated by the number of distance computations,
which is linear with respect to both the
number of instances, $n$, and dimensionality of the problem, $d$. This fact hinders its scalability on such massive data sets.
Another major drawback of the $K$-means algorithm corresponds to its high
dependence on the initial conditions, which
not only may affect the quality of the obtained solution, but may also have a major impact on its computational load, as for instance, a poor
initialization could lead to an exponential running time in
the worst case scenario.
In this dissertation, we tackle all these difficulties. Initially, we propose
an approximation to the $K$-means problem, the Recursive Partition-based $K$-means algorithm (RP$K$M). This approach consists of recursively applying
a weighted version of $K$-means algorithm over a sequence of spatial-based partitions of the data set, for which each
cell of the partition is represented by the center of mass of the points that lie on it. From one
iteration to the next, a more refined partition is constructed and
the process is repeated using the optimal set of centroids, obtained
at the previous iteration, as initialization. From a practical standpoint, such a process reduces the computational load of $K$-means algorithm as the number of representatives, at each iteration,
is commonly much smaller than the number of instances of the data set. On the other hand, both phases of the algorithm are embarrasingly parallel.
From a theoretical standpoint, and in spite of the selected partition strategy, one can guarantee the
non-repetition of the clusterings generated at each RP$K$M iteration, which ultimately implies the reduction
of the total amount of $K$-means algorithm iterations, as well as leading,
in most of the cases, to a monotone decrease of the overall error
function. Afterwards, we present a RP$K$M-type approach, the Boundary Weighted $K$-means algorithm (BW$K$M). For this technique,
the data set partition is based on an
adaptative mesh, that adjusts the size of each grid cell to maximize the chances of each cell having only
instances of the same cluster.
The goal is to focus most of the computational resources
on those regions where it is harder to determine the correct cluster assignment
of the original instances (which is the main source of error for our approximation). For such a construction, it can be proved that
if all the cells of a
spatial partition are well assigned (have instances of the same cluster) at the end of a BW$K$M
step, then the obtained clustering is actually a fixed point
of the $K$-means algorithm over the entire data set. Furthermore, if, for
a certain step of BW$K$M, this property can be verified at
consecutive weighted Lloyd’s iterations, then the error of
our approximation also decreases monotonically. From a practical stand point, BW$K$M was compared to the state-of-the-art:
$K$-means++, Forgy $K$-means, Markov Chain Monte Carlo $K$-means and Minibatch $K$-means. The obtained results show that
BW$K$M commonly converged to solutions,
with a relative error of under $1\%$ with respect to the considered methods, while using a much smaller amount of
distance computations (up to 7 orders of magnitude lower).
Even when the computational cost of BW$K$M is linear with respect to the dimensionality, its error quality guarantees are mainly related to the
diagonal length of the grid cells, meaning that, as we increase the dimensionality of the problem, it will be harder for BW$K$M to
have such a competitive performance. Taking this into consideration, we developed a fully-parellelizable
feature selection technique intended for the $K$-means algorithm, the Univariate $K$-means relevance for feature selection algorithm ($K$MR). This approach
consists of applying any heuristic for the $K$-means problem over multiple subsets of dimensions (each
of which is bounded by a predefined constant, $m\ll d$) and using the obtained clusterings to upper-bound the increase in the $K$-means error when deleting a given feature. We then select
the features with the $m$ largest error increases. Not only can each step of $K$MR be simply parallelized, but its computational cost is dominated by that of the selected
heuristic (on $m$ dimensions), which makes it a suitable dimensionality reduction alternative for BW$K$M on large data sets.
Besides providing a theoretical bound for the obtained solution, via $K$MR, with respect the optimal $K$-means clustering, we
analyze its performance in comparison to well-known feature selection and feature extraction techniques. Such an analysis shows
$K$MR to consistently obtain results with lower $K$-means error than all the considered feature selection
techniques: Laplacian scores, maximum variance and random
selection, while also requiring similar or lower computational
times than these approaches. On the other hand, $K$MR,
when compared to feature extraction techniques, such as Random Projections, also shows a noticeable improvement in both
error and computational time.
As a response to the high dependency of $K$-means algorithm to its initialization, we finally introduce a
cheap and yet effective Split-Merge step that can be used to re-start the $K$-means
algorithm after reaching a fixed point, Split-Merge $K$-means (SM$K$M). Under some settings, one
can show that this approach reduces the error of the given fixed
point without requiring any further iteration of the $K$-means
algorithm. Moreover, experimental results show that this strategy
is able to generate approximations with an associated error that
is hard to reach for different multi-start methods, such as multi-start Forgy $K$-means, $K$-means++ and Hartigan $K$-means. In particular, SM$K$M
consistently generated the local minima with the lowest $K$-means error, reducing, on average, over 1 and 2 orders of
magnitude of relative error with respect to $K$-means++ and Hartigan
$K$-means and Forgy $K$-means, respectively. We have observed that SM$K$M improves the previously commented methods in terms
of the number of distances computed and the error of the obtained solutions.