Beyond Convexity: Submodularity in Machine Learning

Description

Convex optimization has become a main workhorse for many machine learning algorithms during the past ten years. When minimizing a convex loss function for, e.g., training a Support Vector Machine, we can rest assured to efficiently find an optimal solution, even for large problems. In recent years, another fundamental problem structure, which has similar beneficial properties, has emerged as very useful in a variety of machine learning applications: Submodularity is an intuitive diminishing returns property, stating that adding an element to a smaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows one to efficiently find provably (near-)optimal solutions.
In this tutorial, we will give an introduction to the concept of submodularity, discuss algorithms for optimizing submodular functions and – as the main focus – illustrate their usefulness in solving difficult machine learning problems, such as active learning [6,11] and sparse experimental design [23, 14, 13, 3], informative path planning [2, 24, 17], structure learning [19, 1], clustering [20], influence maximization [9] and ranking [15].

Tutorial Video

(recorded Oct 17 2008 at Carnegie Mellon University)


Part I: Minimizing submodular functions	Part II: Maximizing submodular functions

Materials

Extended tutorial slides, updated July 6 2008 [pdf, ppt]
MATLAB Toolbox for submodular function optimization [link]

Outline

Introduction

Motivating Applications
Why should the Machine Learning community care about submodularity?

Submodular set functions

Definitions
Intuition: Why are submodular functions useful
Examples of submodular functions (entropy [8] and mutual information [10, 14], influence in graphs [9], etc.)
Examples of functions which are not submodular
Operations preserving submodularity
Relationship between submodularity and convexity [16]

Minimizing submodular functions

Overview about known results for minimization [22, 7]
Queyranne’s algorithm for minimizing symmetric submodular functions [22], applications to factorizing distributions and clustering [20]
Learning structure of graphical models [19, 1]
The submodular-supermodular procedure with applications to feature selection and structure learning [18]

Maximizing submodular functions

The Greedy algorithm [21,32]
Lazy evaluations and scaling up to large data sets [28]
Applications of maximizing submodular functions
Constrained maximization of submodular functions [25]
Robust maximization of submodular functions (with applications to Gaussian Process regression) [13]

Conclusions (5 min)

Current research directions / Open questions
Other pointers (submodularity in games / allocation problems [4,35] / online algorithms [5])

Who should attend

Since submodularity arises in so many different areas of machine learning, this tutorial should be of theoretical and practical interest to a large part of the machine learning community. The tutorial will not require prior knowledge beyond the basic concepts covered in an introductory machine learning class.

Presenters

Andreas Krause is a Ph.D. Candidate at the Computer Science Department of Carnegie Mellon University. He is a recipient of a Microsoft Research Graduate Fellowship, and his research on sensor placement and information acquisition received awards at several conferences (KDD 2007, IPSN 2006, ICML 2005 and UAI 2005). He obtained his Diplom in Computer Science and Mathematics from the Technische Universität München, where his research received the NRW Undergraduate Science Award.

Carlos Guestrin is an assistant professor in the Machine Learning and in the Computer Science Departments at Carnegie Mellon University. Previously, he was a senior researcher at the Intel Research Lab in Berkeley. Carlos received his MSc and PhD in Computer Science from Stanford University in 2000 and 2003, respectively, and a Mechatronics Engineer degree from the Polytechnic School of the University of Sao Paulo, Brazil, in 1998. Carlos' work received awards at a number of conferences and a journal: KDD 2007, IPSN 2005 and 2006, VLDB 2004, NIPS 2003 and 2007, UAI 2005, ICML 2005, and JAIR in 2007. He is also a recipient of the ONR Young Investigator Award, the NSF Career Award, the Alfred P. Sloan Fellowship, the IBM Faculty Fellowship, the Siebel Scholarship and the Stanford Centennial Teaching Assistant Award. Carlos is currently a member of the Information Sciences and Technology (ISAT) advisory group for DARPA.

References

The following references are used in the tutorial. A short high level summary is given, without any claim of completeness.

[1] A. Chechetka and C. Guestrin. Efficient principled learning of thin junction trees. In In Advances in Neural Information Processing Systems (NIPS), Vancouver, Canada, December 2007 [pdf]

Uses submodular function optimization to speed up structure learning.

[2] C. Chekuri and M. Pal. A recursive greedy algorithm for walks in directed graphs. In Annual Symposium on Foundation of Computer Science (FOCS), pages 245–253, 2005 [link]

Slightly superpolynomial approximation algorithm for submodular path planning (orienteering).

[3] A. Das and D. Kempe. Algorithms for subset selection in linear regression. In STOC, 2008. [pdf]

Variance reduction in Gaussian models is submodular under certain conditions on the covariance.

[4] U. Feige. On maximizing welfare when utility functions are subadditive. In STOC, 2006 [pdf]

Generalization of the submodular welfare problem to additive set functions.

[5] D. Golovin and M. Streeter. Online algorithms for maximizing submodular set functions. In Technical Report CMU-CS-07-171. 2007. [pdf]

Online algorithm for maximizing submodular functions, with application to scheduling.

[6] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application to medical image classification. In ICML, 2006. [link]

Defines a monotonic submodular criterion for batch-mode active learning of kernelized logistic regression.

[7] S. Iwata, L. Fleischer, and S. Fujishige. A combinatorial strongly polynomial algorithm for minimizing submodular functions. Journal of the ACM, 48(4):761–777, 2001.

Strongly polynomial algorithm for minimizing arbitrary submodular functions.

[8] A. K. Kelmans and B. N. Kimelfeld. Multiplicative submodularity of a matrix’s principal minor as a function of the set of its rows and some combinatorial applications. Discrete Mathematics, 44(1):113–116, 1980.

Proves that entropy is a submodular function.

[9] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD, 2003 [pdf]

Maximizing influence (viral marketing) in social networks is submodular.

[10] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. In UAI, 2005 [pdf]

Information gain in Naive Bayes models is submodular. Also gives (1-1/e) hardness of approximation result for information gain in such models.

[11] A. Krause and C. Guestrin. Nonmyopic active learning of gaussian processes: An exploration—exploitation approach. In ICML, 2007 [pdf]

Using submodularity to analyze sequential experimental design in Gaussian Processes with uncertain kernel parameters.

[12] A. Krause, C. Guestrin, A. Gupta, and J. Kleinberg. Near-optimal sensor placements: Maximizing information while minimizing communication cost. In Proceedings of the Fifth International Symposium on Information Processing in Sensor Networks (IPSN), 2006 [pdf]

Optimizing monotonic submodular functions subject to communication constraints.

[13] A. Krause, B. McMahan, C. Guestrin, and A. Gupta. Selecting observations against adversarial objectives. In NIPS, 2007 [pdf]

Maximizing the minimum over a set of monotonic submodular functions, with applications to robust experimental design.

[14] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. In JMLR, 2008.

Proves approximate monotonicity of mutual information for experimental design in Gaussian Processes.

[15] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD, 2007 [pdf]

Proves monotonic submodularity for outbreak detection in networks, with applications to sensor placement and selecting informative blogs.

[16] L. Lovasz. Submodular functions and convexity. Mathematical Programming - State of the Art, pages 235–257, 1983.

Discusses connections between submodular and convex functions. Introduces Lovasz extension.

[17] A. Meliou, A. Krause, C. Guestrin, and J. Hellerstein. Nonmyopic informative path planning in spatio-temporal models. In AAAI, 2007.

Spatio-temporal submodular path planning.

[18] M. Narasimhan and J. Bilmes. A submodular-supermodular procedure with applications to discriminative structure learning. In Advances in Neural Information Processing Systems (NIPS) 19, 2006 [pdf]

Maximizing the difference of two submodular functions.

[19] M. Narasimhan and J. Bilmes. Pac-learning bounded tree-width graphical models. In Uncertainty in Artificial Intelligence, 2004 [pdf]

Uses symmetric submodular function minimization to find separators, which allows efficient structure learning.

[20] M. Narasimhan, N. Jojic, and J. Bilmes. Q-clustering. In NIPS, 2005 [pdf].

Shows how Queyranne's algorithm can be used for (near-)optimal clustering.

[21] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14:265–294, 1978.

(1-1/e) guarantee for the greedy algorithm for maximizing submodular functions.

[22] M. Queyranne. A combinatorial algorithm for minimizing symmetric submodular functions. In SODA, 1995.

Strongly polynomial algorithm for minimizing symmetric submodular functions.

[23] T. G. Robertazzi and S. C. Schwartz. An accelerated sequential algorithm for producing D-optimal designs. SIAM Journal of Scientific and Statistical Computing, 10(2):341–358, March 1989.

Uses lazy evaluations to construct D-optimal designs.

[24] A. Singh, A. Krause, C. Guestrin, W. Kaiser, and M. Batalin. Efficient planning of informative paths for multiple robots. In IJCAI, 2007 [pdf]

An efficient approximation algorithm for submodular path planning (orienteering), with extensions for multiple robots.

[25] M. Sviridenko. A note on maximizing a submodular set function subject to knapsack constraint. Operations Research Letters, 32:41–43, 2004 [link]

Shows that (1-1/e) guarantee can be achieved for budgeted maximization of submodular functions.

[26] L. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2:385–393, 1982.

Proves guarantees for the greedy algorithm for submodular coverage.

[27] M. Goemans, and V. Ramakrishnan. Minimizing submodular functions over families of sets. Combinatorica, 15(4): 499-513, 1995

Guarantees for constrained submodular minimization.

[28] M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques, LNCS 234-243, 1978 [link]

Introduces the lazy greedy algorithm and online bounds for maximizing submodular functions.

[29] M. Groetschel, L. Lovasz, and A. Schrijver. The ellipsoid method and its consequences in combinatorial optimization. In Combinatorica, 1:169-197, 1981

Shows how the ellipsoid algorithm can minimize submodular functions in polynomial time.

[30] M. Queyranne. A Gomory-Hu tree for symmetric submodular functions, 1983

Proves the existence of Gomory-Hu trees for arbitrary symmetric submodular functions.

[31] L. Zhao, H. Nagamochi, and T. Ibaraki. Greedy splitting algorithms for approximating multiway partition problems, Mathematical Programming 102(1): 167-183, 2005 [link]

Analyzes greedy splitting for submodular clustering.

[32] M. Fisher, G. Nemhauser and L. Wolsey. An analysis of approximations for maximizing submodular set functions II, Math. Programming Study 8: 73–87, 1978

Guarantees for maximizing submodular functions over the intersection of p matroids.

[33] S. Fujishige, T. Hayashi, and S. Isotani. The Minimum-Norm-Point Algorithm Applied to Submodular Function Minimization and Linear Programming. Manuscript September 2006 [pdf]

Empirical study on the performance of the minimum norm algorithm.

[34] S. Fujishige. Submodular functions and optimization. Annals of Discrete Mathematics, Elsevier 2nd Edition 2005, 1st Edition 1991

Monograph on submodular functions.

[35] J. Vondrak. Optimal approximation for the Submodular Welfare Problem in the value oracle model, STOC 2008 [pdf]

Proves (1-1/e) guarantee for a continuous greedy algorithm for maximizing submodular functions subject to arbitrary matroid constraints.

[36] J. Edmonds. Submodular functions, matroids, and certain polyhedra. Proceedings of the Calgary International Conference on Combinatorial Structures and Their Applications (R. Guy, H. Hanani, N. Sauer and J. Schoenheim, eds., Gordon and Breach, New York, 1970), pp. 69–87; also in: Combinatorial Optimization—Eureka, You Shrink! (M. Juenger, G. Reinelt, and G. Rinaldi, eds., Lecture Notes in Computer Science 2570, Springer, Berlin, 2003), pp. 11–26.

Studies submodular functions, proves duality results, ...

[37] A. Krause, C. Guestrin. A Note on the Budgeted Maximization on Submodular Functions. Technical Report CMU-CALD-05-103, 2005 [pdf]

Guarantees for submodular function maximization if function is perturbed by noise / only approximately submodular.

[38] M. Seeger. Greedy Forward Selection in the Informative Vector Machine. Manuscript 2004 [pdf]

Proves that the greedy algorithm used in the IVM optimizes a submodular function.

Last updated: November 13 2008