Large-scale Machine Learning

Over the last several years, the world has witnessed the emergence of datasets of an unprecedented scale across different scientific disciplines. The large volume of such datasets presents new computational challenges as the diverse, feature-rich, unstructured and usually high-resolution data does not allow for effective data-intensive inference. In particular, existing algorithms with superlinear complexity are computationally infeasible in the context of millions or even billions of data points.

Coresets

The idea behind coresets is that massive datasets can be represented by a small subset of representative data points while retaining theoretical guarantess on the approximation quality. It is hence possible to obtain provably “competitive” solutions by training models on such coresets. Coreset sizes are usually sublinear in the number of original observations if not independent of it. This allows for substantially faster approximate inference and allows the use of algorithms with superlinear complexity due to the sublinear (or even constant) size of coresets.

Submodular Maximization at Scale

A central challenge in large-scale machine learning is data summarization: Extracting a small, relevant yet diverse, representative subset. Many objective functions quantifying these notions satisfy submodularity: adding an element to a smaller summary helps more than adding it to a larger summary. We pursue novel algorithms for submodular optimization in light of big data, for example, using randomization, streaming and distributed computation.

Publications

2018
Scalable k-Means Clustering via Lightweight Coresets O. Bachem, M. Lucic, A. Krause In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018 [bibtex] [abstract] [pdf]
One-Shot Coresets: The Case of k-Clustering O. Bachem, M. Lucic, S. Lattanzi In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2018 [bibtex] [abstract] [pdf]
2017
Distributed and Provably Good Seedings for k-Means in Constant Rounds O. Bachem, M. Lucic, A. Krause In Proc. International Conference on Machine Learning (ICML), 2017 [bibtex] [abstract] [pdf]
Uniform Deviation Bounds for k-Means Clustering O. Bachem, M. Lucic, S. H. Hassani, A. Krause In Proc. International Conference on Machine Learning (ICML), 2017 [bibtex] [abstract] [pdf]
2016
Fast and Provably Good Seedings for k-Means O. Bachem, M. Lucic, S. H. Hassani, A. Krause In Proc. Neural Information Processing Systems (NeurIPS), 2016 Oral presentation [bibtex] [abstract] [pdf] [code] [video]
Horizontally Scalable Submodular Maximization M. Lucic, O. Bachem, M. Zadimoghaddam, A. Krause In Proc. International Conference on Machine Learning (ICML), 2016 [bibtex] [abstract] [pdf] [long] [video]
Linear-Time Outlier Detection via Sensitivity M. Lucic, O. Bachem, A. Krause In Proc. International Joint Conference on Artificial Intelligence (IJCAI), 2016 [bibtex] [abstract] [pdf]
Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures M. Lucic, O. Bachem, A. Krause In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2016 [bibtex] [abstract] [pdf]
Approximate K-Means++ in Sublinear Time O. Bachem, M. Lucic, S. H. Hassani, A. Krause In Proc. Conference on Artificial Intelligence (AAAI), 2016 [bibtex] [abstract] [pdf]
Distributed Submodular Maximization B. Mirzasoleiman, A. Karbasi, R. Sarkar, A. Krause In Journal of Machine Learning Research (JMLR), 2016 [bibtex] [abstract] [pdf]
2015
Distributed Submodular Cover: Succinctly Summarizing Massive Data B. Mirzasoleiman, A. Karbasi, A. Badanidiyuru, A. Krause In Proc. Neural Information Processing Systems (NeurIPS), 2015 Spotlight presentation [bibtex] [abstract] [pdf] [long]
Coresets for Nonparametric Estimation – the Case of DP-Means O. Bachem, M. Lucic, A. Krause In Proc. International Conference on Machine Learning (ICML), 2015 [bibtex] [abstract] [pdf] [long]
Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning M. Lucic, M. I. Ohannessian, A. Karbasi, A. Krause In In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2015 Best Student Paper Award [bibtex] [abstract] [pdf] [long]
Lazier than Lazy Greedy B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrak, A. Krause In Proc. Conference on Artificial Intelligence (AAAI), 2015 [bibtex] [abstract] [pdf]
2014
Fast and Robust Least Squares Estimation in Corrupted Linear Models B. McWilliams, G. Krummenacher, M. Lucic, J. Buhmann In Proc. Neural Information Processing Systems (NeurIPS), 2014 Spotlight presentation [bibtex] [abstract] [pdf] [long]
Streaming Submodular Optimization: Massive Data Summarization on the Fly A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, A. Krause In Proc. ACM Conference on Knowledge Discovery in Databases (KDD), 2014 [bibtex] [abstract] [pdf]
2013
Distributed Submodular Maximization: Identifying Representative Elements in Massive Data B. Mirzasoleiman, A. Karbasi, R. Sarkar, A. Krause In Proc. Neural Information Processing Systems (NeurIPS), 2013 [bibtex] [abstract] [pdf] [long]

Large-scale Machine Learning

Coresets

Submodular Maximization at Scale

Publications

2018

2017

2016

2015

2014

2013