Large-scale Machine Learning
Over the last several years, the world has witnessed the emergence of datasets of an unprecedented scale across different scientific disciplines. The large volume of such datasets presents new computational challenges as the diverse, feature-rich, unstructured and usually high-resolution data does not allow for effective data-intensive inference. In particular, existing algorithms with superlinear complexity are computationally infeasible in the context of millions or even billions of data points.
CoresetsThe idea behind coresets is that massive datasets can be represented by a small subset of representative data points while retaining theoretical guarantess on the approximation quality. It is hence possible to obtain provably “competitive” solutions by training models on such coresets. Coreset sizes are usually sublinear in the number of original observations if not independent of it. This allows for substantially faster approximate inference and allows the use of algorithms with superlinear complexity due to the sublinear (or even constant) size of coresets.
Submodular Maximization at ScaleA central challenge in large-scale machine learning is data summarization: Extracting a small, relevant yet diverse, representative subset. Many objective functions quantifying these notions satisfy submodularity: adding an element to a smaller summary helps more than adding it to a larger summary. We pursue novel algorithms for submodular optimization in light of big data, for example, using randomization, streaming and distributed computation.