Show simple item record

dc.contributor.authorCapo M.en_US
dc.contributor.authorPérez A.en_US
dc.contributor.authorLozano J.A.en_US
dc.date.accessioned2020-04-02T10:26:49Z
dc.date.available2020-04-02T10:26:49Z
dc.date.issued2020
dc.identifier.urihttp://hdl.handle.net/20.500.11824/1099
dc.description.abstractThe analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial topic in unsupervised learning. Cluster analysis algorithms are a key element of exploratory data analysis and, among them, the K-means algorithm stands out as the most popular approach due to its easiness in the implementation, straightforward parallelizability and relatively low computational cost. Unfortunately, the K-means algorithm also has some drawbacks that have been extensively studied, such as its high dependency on the initial conditions, as well as to the fact that it might not scale well on massive datasets. In this article, we propose a recursive and parallel approximation to the K-means algorithm that scales well on the number of instances of the problem, without affecting the quality of the approximation. In order to achieve this, instead of analyzing the entire dataset, we work on small weighted sets of representative points that are distributed in such a way that more importance is given to those regions where it is harder to determine the correct cluster assignment of the original instances. In addition to different theoretical properties, which explain the reasoning behind the algorithm, experimental results indicate that our method outperforms the state-of-the-art in terms of the trade-off between number of distance computations and the quality of the solution obtained.en_US
dc.formatapplication/pdfen_US
dc.language.isoengen_US
dc.publisherDATA MINING AND KNOWLEDGE DISCOVERYen_US
dc.relationES/1PE/SEV-2017-0718en_US
dc.relationES/1PE/TIN2017-82626-Ren_US
dc.relationEUS/ELKARTEKen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/es/en_US
dc.subjectK-means problemen_US
dc.subjectLloyd's algorithmen_US
dc.subjectK-means++en_US
dc.subjectCoresetsen_US
dc.subjectUnsupervised learningen_US
dc.titleAn efficient K-means clustering algorithm for tall dataen_US
dc.typeinfo:eu-repo/semantics/articleen_US
dc.typeinfo:eu-repo/semantics/acceptedVersionen_US
dc.identifier.doihttps://doi.org/10.1007/s10618-020-00678-9
dc.relation.publisherversionhttps://link.springer.com/article/10.1007/s10618-020-00678-9en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

info:eu-repo/semantics/openAccess
Except where otherwise noted, this item's license is described as info:eu-repo/semantics/openAccess