by O. Bachem, M. Lucic, S. H. Hassani, A. Krause

Abstract:

Seeding - the task of finding initial cluster centers - is critical in obtaining high-quality clusterings for k-Means. However, k-means++ seeding, the state of the art algorithm, does not scale well to massive datasets as it is inherently sequential and requires k full passes through the data. It was recently shown that Markov chain Monte Carlo sampling can be used to efficiently approximate the seeding step of k-means++. However, this result requires assumptions on the data generating distribution. We propose a simple yet fast seeding algorithm that produces *provably* good clusterings even *without assumptions* on the data. Our analysis shows that the algorithm allows for a favourable trade-off between solution quality and computational cost, speeding up k-means++ seeding by up to several orders of magnitude. We validate our theoretical results in extensive experiments on a variety of real-world data sets.

Reference:

Fast and Provably Good Seedings for k-Means O. Bachem, M. Lucic, S. H. Hassani, A. KrauseIn Proc. Neural Information Processing Systems (NeurIPS), 2016Oral presentation

Bibtex Entry:

@inproceedings{bachem16fast,
	author = {Olivier Bachem and Mario Lucic and S. Hamed Hassani and Andreas Krause},
	booktitle = {Proc. Neural Information Processing Systems (NeurIPS)},
	month = {December},
	title = {Fast and Provably Good Seedings for k-Means},
	video = {https://youtu.be/QtQyeka-tlQ},
	year = {2016}}