Navigation
  • About
  • People
  • Publications
  • Blog
  • Teaching
  • Student Projects
  • Openings
  • Contact
  • About
  • People
  • Publications
  • Blog
  • Teaching
  • Student Projects
  • Openings
  • Contact

Large-scale Machine Learning

Over the last several years, the world has witnessed the emergence of datasets of an unprecedented scale across different scientific disciplines. The large volume of such datasets presents new computational challenges as the diverse, feature-rich, unstructured and usually high-resolution data does not allow for effective data-intensive inference. In particular, existing algorithms with superlinear complexity are computationally infeasible in the context of millions or even billions of data points.

Coresets

The idea behind coresets is that massive datasets can be represented by a small subset of representative data points while retaining theoretical guarantess on the approximation quality. It is hence possible to obtain provably “competitive” solutions by training models on such coresets. Coreset sizes are usually sublinear in the number of original observations if not independent of it. This allows for substantially faster approximate inference and allows the use of algorithms with superlinear complexity due to the sublinear (or even constant) size of coresets. 

Submodular Maximization at Scale

A central challenge in large-scale machine learning is data summarization: Extracting a small, relevant yet diverse, representative subset. Many objective functions quantifying these notions satisfy submodularity: adding an element to a smaller summary helps more than adding it to a larger summary. We pursue novel algorithms for submodular optimization in light of big data, for example, using randomization, streaming and distributed computation.

Publications

2018
  • Scalable k-Means Clustering via Lightweight Coresets
  • O. Bachem, M. Lucic, A. Krause
  • In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018
  • [bibtex] [abstract] [pdf]
  • One-Shot Coresets: The Case of k-Clustering
  • O. Bachem, M. Lucic, S. Lattanzi
  • In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2018
  • [bibtex] [abstract] [pdf]
2017
  • Distributed and Provably Good Seedings for k-Means in Constant Rounds
  • O. Bachem, M. Lucic, A. Krause
  • In Proc. International Conference on Machine Learning (ICML), 2017
  • [bibtex] [abstract] [pdf]
  • Uniform Deviation Bounds for k-Means Clustering
  • O. Bachem, M. Lucic, S. H. Hassani, A. Krause
  • In Proc. International Conference on Machine Learning (ICML), 2017
  • [bibtex] [abstract] [pdf]
2016
  • Fast and Provably Good Seedings for k-Means
  • O. Bachem, M. Lucic, S. H. Hassani, A. Krause
  • In Proc. Neural Information Processing Systems (NeurIPS), 2016
  • Oral presentation
  • [bibtex] [abstract] [pdf] [code] [video]
  • Horizontally Scalable Submodular Maximization
  • M. Lucic, O. Bachem, M. Zadimoghaddam, A. Krause
  • In Proc. International Conference on Machine Learning (ICML), 2016
  • [bibtex] [abstract] [pdf] [long] [video]
  • Linear-Time Outlier Detection via Sensitivity
  • M. Lucic, O. Bachem, A. Krause
  • In Proc. International Joint Conference on Artificial Intelligence (IJCAI), 2016
  • [bibtex] [abstract] [pdf]
  • Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures
  • M. Lucic, O. Bachem, A. Krause
  • In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2016
  • [bibtex] [abstract] [pdf]
  • Approximate K-Means++ in Sublinear Time
  • O. Bachem, M. Lucic, S. H. Hassani, A. Krause
  • In Proc. Conference on Artificial Intelligence (AAAI), 2016
  • [bibtex] [abstract] [pdf]
  • Distributed Submodular Maximization
  • B. Mirzasoleiman, A. Karbasi, R. Sarkar, A. Krause
  • In Journal of Machine Learning Research (JMLR), 2016
  • [bibtex] [abstract] [pdf]
2015
  • Distributed Submodular Cover: Succinctly Summarizing Massive Data
  • B. Mirzasoleiman, A. Karbasi, A. Badanidiyuru, A. Krause
  • In Proc. Neural Information Processing Systems (NeurIPS), 2015
  • Spotlight presentation
  • [bibtex] [abstract] [pdf] [long]
  • Coresets for Nonparametric Estimation – the Case of DP-Means
  • O. Bachem, M. Lucic, A. Krause
  • In Proc. International Conference on Machine Learning (ICML), 2015
  • [bibtex] [abstract] [pdf] [long]
  • Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning
  • M. Lucic, M. I. Ohannessian, A. Karbasi, A. Krause
  • In In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2015
  • Best Student Paper Award
  • [bibtex] [abstract] [pdf] [long]
  • Lazier than Lazy Greedy
  • B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrak, A. Krause
  • In Proc. Conference on Artificial Intelligence (AAAI), 2015
  • [bibtex] [abstract] [pdf]
2014
  • Fast and Robust Least Squares Estimation in Corrupted Linear Models
  • B. McWilliams, G. Krummenacher, M. Lucic, J. Buhmann
  • In Proc. Neural Information Processing Systems (NeurIPS), 2014
  • Spotlight presentation
  • [bibtex] [abstract] [pdf] [long]
  • Streaming Submodular Optimization: Massive Data Summarization on the Fly
  • A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, A. Krause
  • In Proc. ACM Conference on Knowledge Discovery in Databases (KDD), 2014
  • [bibtex] [abstract] [pdf]
2013
  • Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
  • B. Mirzasoleiman, A. Karbasi, R. Sarkar, A. Krause
  • In Proc. Neural Information Processing Systems (NeurIPS), 2013
  • [bibtex] [abstract] [pdf] [long]
  • About
  • People
  • Publications
  • Blog
  • Teaching
  • Student Projects
  • Openings
  • Contact

Learning & Adaptive Systems Group | Institute for Machine Learning | © 2025 ETH Zurich