Data Mining: Learning from Large Data Sets
Overview
Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In this graduate-level course, we will study principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets. The course will both cover theoretical foundations and practical applications.
Topics covered
- Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
- Fast nearest neighbor methods (Shingling, locality sensitive hashing)
- Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
- Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
- Active learning (uncertainty sampling, pool-based methods, label complexity)
- Dimension reduction (random projections, nonlinear methods)
- Data streams (Sketches, coresets, applications to online clustering)
- Recommender systems
News
Details
- VVZ Information: See here.
- Recitations:
- Thu 15-16 in NO C 6. Last names starting with A-M
- Fri 14-15 in NO C 6. Last names starting with N-Z
- Textbook: A. Rajaraman, J. Ullman. Mining of Massive Data Sets. Available here.
Homeworks
- Homework 1 [pdf]. Supplemental material [zip]
- Homework 2 [pdf]. Supplemental material [zip]
- Homework 3 [pdf]
- Homework 4 [pdf]
- Homework 5 [pdf]. Supplemental material [py]
Solutions
- Homework 1 [pdf]. Implementation: [zip]
- Homework 2 [pdf]. Implementation: [zip]
- Homework 3 [pdf]
- Homework 4 [pdf]. Implementation: [py]
- Homework 5 [pdf]. Implementation: [py]
Lecture Notes
- Feb 21: Introduction [pdf]
- March 6: Min-hashing [pdf]
- March 13: Locality sensitive hashing [pdf]
- March 27: SVMs; online convex programming [pdf]
- April 3: Parallel learning; other loss functions and regularizers [pdf]
- April 17: Regression, model selection [pdf]
- April 24: Active Learning [pdf]
- May 8: Large scale unsupervised learning (Online k-means, coresets) [pdf]
- May 15: Large scale unsupervised learning (Online EM, coresets,
anomaly detection) [pdf]
- May 22: Exploration--exploitation tradeoffs (k-armed bandits,
upper confidence sampling) [pdf]
- May 29: Exploration--exploitation tradeoffs in recommender systems; submodular functions [pdf]
Recitations
- March 8: Nearest neighbor search [pdf]
- March 15: Probability & linear algebra [pdf]
- March 22: LSH & HW1 [pdf]
- March 29: Online SVM [pdf]
Old Exams
- Data Mining Exam, 2011 Spring [pdf]
Relevant Readings
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
- James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.
- Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is "Nearest Neighbor" Meaningful?. Proc. 7th International Conference on Database Theory (ICDT 1999) LNCS 1540: 217-235.
- Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search in High Dimensions via Hashing [ps]. 25th International Conference on Very Large Databases (VLDB) , 1999
- Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. Proc. International Conference on Machine Learning 2003 [pdf]
- Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 2010 [pdf]
- Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm support vector machines. Advances in Neural Information Processing Systems 2003 [html]
- Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. Advances in Neural Information Processing Systems 2003 [pdf]
- Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of the Seventeenth International Conference on Machine Learning 2000 [ps]
- Chris Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Chapter 9
- Percy Liang, Dan Klein. Online EM for Unsupervised Models. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2009 [pdf
]
- Dan Feldman, Morteza Monemizadeh, Christian Sohler. A PTAS for
k-Means Clustering Based on Weak Coresets. Proc. 23th Annu. ACM Symposium on Computational Geometry (SoCG) 2007 [pdf]
- Dan Feldman, Matthew Faulkner, Andreas Krause. Scalable Training of Mixture Models via Coresets. In Proc. Neural Information Processing Systems, 2011 [pdf
]
- Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 2002.
[pdf]
- Niranjan Srinivas, Andreas Krause, Sham Kakade, Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proc. International Conference on Machine Learning (ICML), 2010
[pdf]
- Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article
recommendation. In WWW-10, Raleigh, NC, April, 2010.
[pdf]
- Lihong Li, Wei Chu, John Langford, Robert Schapire. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proc. WWW 2010 [pdf]
- Khalid El-Arini, Gaurav Veda, Dafna Shahaf and Carlos Guestrin. Turning Down the Noise in the Blogosphere. In Proc. of 15th International Conference on Knowledge Discovery and Data Mining (KDD 2009) [pdf]
- Matthew Streeter, Daniel Golovin, Andreas Krause. Online Learning of Assignments. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 [pdf]
- Andreas Krause, Carlos Guestrin. Beyond Convexity - Submodularity in Machine Learning. Tutorial at ICML 2008. [pdf]
Related Courses