dc.contributor.author Capo, M. dc.contributor.author Pérez, A. dc.contributor.author Lozano, J.A. dc.date.accessioned 2016-06-28T11:57:05Z dc.date.available 2016-06-28T11:57:05Z dc.date.issued 2016-06-28 dc.identifier.issn 0950-7051 dc.identifier.uri http://hdl.handle.net/20.500.11824/289 dc.description.abstract Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to manipulate and analyze such information. In spite of its dependency on the initial settings and the large number of distance computations that it can require to converge, the $K$-means algorithm remains as one of the most popular clustering methods for massive datasets. In this work, we propose an efficient approximation to the $K$-means problem intended for massive data. Our approach recursively partitions the entire dataset into a small number of subsets, each of which is characterized by its representative (center of mass) and weight (cardinality), afterwards a weighted version of the $K$-means algorithm is applied over such local representation, which can drastically reduce the number of distances computed. In addition to some theoretical properties, experimental results indicate that our method outperforms well-known approaches, such as the $K$-means++ and the minibatch $K$-means, in terms of the relation between number of distance computations and the quality of the approximation. en_US dc.description.sponsorship Marco Capó and Aritz Pérez are partially supported by the Basque Government, Elkartek and by the Spanish Ministry of Economy and Competitiveness MINECO: BCAM Severo Ochoa excelence accreditation SVP-2014- 068574 and SEV-2013-0323. José A. Lozano is partially supported by the Basque Government (IT609-13), Elkartek and the Spanish Ministry of Economy and Competitiveness MINECO (TIN2013-41272P). en_US dc.format application/pdf en_US dc.language.iso eng en_US dc.rights Reconocimiento-NoComercial-CompartirIgual 3.0 España en_US dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/es/ en_US dc.subject K-means en_US dc.subject clustering en_US dc.subject K-means++ en_US dc.subject minibatch K-means en_US dc.title An efficient approximation to the K-means clustering for Massive Data en_US dc.type info:eu-repo/semantics/article en_US dc.identifier.doi 10.1016/j.knosys.2016.06.031 dc.relation.projectID ES/1PE/SEV-2013-0323 en_US dc.relation.projectID BERC en_US dc.rights.accessRights info:eu-repo/semantics/openAccess en_US dc.type.hasVersion info:eu-repo/semantics/acceptedVersion en_US dc.journal.title Knowledge-Based Systems en_US
