Data Mining: Learning from Large Data Sets

Overview

Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In this graduate-level course, we will study principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets. The course will both cover theoretical foundations and practical applications.

Topics covered

Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
Fast nearest neighbor methods (Shingling, locality sensitive hashing)
Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
Active learning (uncertainty sampling, pool-based methods, label complexity)
Dimension reduction (random projections, nonlinear methods)
Data streams (Sketches, coresets, applications to online clustering)
Recommender systems

News

No class on March 20.

Details

VVZ Information: See here.
Recitations:
- Thu 15-16 in NO C 6. Last names starting with A-M
- Fri 14-15 in NO C 6. Last names starting with N-Z
Textbook: A. Rajaraman, J. Ullman. Mining of Massive Data Sets. Available here.

Homeworks

Homework 1 [pdf]. Supplemental material [zip]
Homework 2 [pdf]. Supplemental material [zip]
Homework 3 [pdf]
Homework 4 [pdf]
Homework 5 [pdf]. Supplemental material [py]

Solutions

Homework 1 [pdf]. Implementation: [zip]
Homework 2 [pdf]. Implementation: [zip]
Homework 3 [pdf]
Homework 4 [pdf]. Implementation: [py]
Homework 5 [pdf]. Implementation: [py]

Lecture Notes

Feb 21: Introduction [pdf]
March 6: Min-hashing [pdf]
March 13: Locality sensitive hashing [pdf]
March 27: SVMs; online convex programming [pdf]
April 3: Parallel learning; other loss functions and regularizers [pdf]
April 17: Regression, model selection [pdf]
April 24: Active Learning [pdf]
May 8: Large scale unsupervised learning (Online k-means, coresets) [pdf]
May 15: Large scale unsupervised learning (Online EM, coresets, anomaly detection) [pdf]
May 22: Exploration--exploitation tradeoffs (k-armed bandits, upper confidence sampling) [pdf]
May 29: Exploration--exploitation tradeoffs in recommender systems; submodular functions [pdf]

Recitations

March 8: Nearest neighbor search [pdf]
March 15: Probability & linear algebra [pdf]
March 22: LSH & HW1 [pdf]
March 29: Online SVM [pdf]

Old Exams

Data Mining Exam, 2011 Spring [pdf]

Relevant Readings

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is "Nearest Neighbor" Meaningful?. Proc. 7th International Conference on Database Theory (ICDT 1999) LNCS 1540: 217-235.
Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search in High Dimensions via Hashing [ps]. 25th International Conference on Very Large Databases (VLDB) , 1999
Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. Proc. International Conference on Machine Learning 2003 [pdf]
Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 2010 [pdf]
Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm support vector machines. Advances in Neural Information Processing Systems 2003 [html]
Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. Advances in Neural Information Processing Systems 2003 [pdf]
Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of the Seventeenth International Conference on Machine Learning 2000 [ps]
Chris Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Chapter 9
Percy Liang, Dan Klein. Online EM for Unsupervised Models. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2009 [pdf ]
Dan Feldman, Morteza Monemizadeh, Christian Sohler. A PTAS for k-Means Clustering Based on Weak Coresets. Proc. 23th Annu. ACM Symposium on Computational Geometry (SoCG) 2007 [pdf]
Dan Feldman, Matthew Faulkner, Andreas Krause. Scalable Training of Mixture Models via Coresets. In Proc. Neural Information Processing Systems, 2011 [pdf ]
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 2002. [pdf]
Niranjan Srinivas, Andreas Krause, Sham Kakade, Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proc. International Conference on Machine Learning (ICML), 2010 [pdf]
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW-10, Raleigh, NC, April, 2010. [pdf]
Lihong Li, Wei Chu, John Langford, Robert Schapire. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proc. WWW 2010 [pdf]
Khalid El-Arini, Gaurav Veda, Dafna Shahaf and Carlos Guestrin. Turning Down the Noise in the Blogosphere. In Proc. of 15th International Conference on Knowledge Discovery and Data Mining (KDD 2009) [pdf]
Matthew Streeter, Daniel Golovin, Andreas Krause. Online Learning of Assignments. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 [pdf]
Andreas Krause, Carlos Guestrin. Beyond Convexity - Submodularity in Machine Learning. Tutorial at ICML 2008. [pdf]

Related Courses

CS345a: Data Mining at Stanford University
15-826: Multimedia Databases and Data Mining at Carnegie Mellon University
CS/CNS/EE 253: Advanced Topics in Machine Learning at Caltech