Data Mining: Learning from Large Data Sets

Overview

Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In this graduate-level course, students will learn to apply, analyze and evaluate principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets.

Topics

Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
Fast nearest neighbor methods (Shingling, locality sensitive hashing)
Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
Active learning (uncertainty sampling, pool-based methods, label complexity)
Dimension reduction (random projections, nonlinear methods)
Data streams (Sketches, coresets, applications to online clustering)
Recommender systems

News

Please watch project website for updates. First part of project due on April 14.
First class on 19.2.

Details

VVZ Information: See here.
Recitations:
- Tue 13-14 in CAB G 61. Last names starting with A-L
- Fri 14-15 in NO C 6. Last names starting with M-Z
Textbook: A. Rajaraman, J. Ullman. Mining of Massive Data Sets. Available here.

Homeworks

Self Assessment Questions [pdf]
Homework 1 [pdf]. Data set [zip]
Homework 2 [pdf]
Homework 3 [pdf]
Homework 4 [pdf]
Homework 5 [pdf]
Homework 6 [pdf]

Solutions

Homework 1 [zip]
Homework 2 [pdf]
Homework 3 [pdf]
Homework 4 [pdf]
Homework 5 [pdf]
Homework 6 [pdf]

Lecture Notes

February 19: Introduction [pdf]
February 26: Approximate Retrieval; Min-hashing [pdf]
March 5: Locality sensitive hashing [pdf]
March 12: SVMs; online convex programming [pdf]
March 19: (Parallel) stochastic gradient descent [pdf]
March 26: Feature selection via l1-regularization; multi-class/structured prediction [pdf]
April 9: Active Learning [pdf]
April 16: Large scale unsupervised learning (Online k-means, coresets) [pdf]
April 23: Large scale unsupervised learning (Online EM, coresets, anomaly detection) [pdf]
April 30: Exploration--exploitation tradeoffs (k-armed bandits, upper confidence sampling) [pdf]
May 7: Contextual bandits [pdf]
May 14: Submodular functions (properties, algorithms and applications) [pdf]
May 28: Recommending sets (structured prediction, online submodular optimization) [pdf]

Recitations

Feb 26: Hadoop tutorial [pdf]
Mar 5: LSH & NN [pdf]
Mar 12: SVM [pdf]
Mar 19: Project 1 - Approximate Retrieval
April 9: Online SVMs (HW3/Loss Functions/L1 Regularization)
April 16: Project 2 - Large-scale Classification [pdf]
April 23: Active Learning [pdf]
April 30: Unsupervised Learning
May 7: Exploration-Exploitation [pdf]
May 14: Project 3 - Recommender Systems [pdf]

Old Exams

Data Mining Exam, 2012 Spring [pdf]

Relevant Readings

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
Jure Leskovec, Eric Horvitz. Planetary-Scale Views on a Large Instant-Messaging Network. In Proceedings WWW 2008 [pdf]
Manuel Gomez Rodriguez, Jure Leskovec, Andreas Krause. Inferring Networks of Diffusion and Influence, In Proc. ACM Conference on Knowledge Discovery in Databases (KDD), 2010 [pdf]
James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is "Nearest Neighbor" Meaningful?. Proc. 7th International Conference on Database Theory (ICDT 1999) LNCS 1540: 217-235.
Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search in High Dimensions via Hashing [ps]. 25th International Conference on Very Large Databases (VLDB) , 1999
Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. Proc. International Conference on Machine Learning 2003 [pdf]
Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 2010 [pdf]
Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm support vector machines. Advances in Neural Information Processing Systems 2003 [html]
John Duchi, Shai Shalev-Shwartz, Yoram Singer, Tushar Chandra. Efficient Projections onto the l1-Ball for Learning in High Dimensions. ICML 2008 [pdf]
Nathan Ratliff, J. Andrew (Drew) Bagnell, and Martin Zinkevich. (Online) Subgradient Methods for Structured Prediction. AISTATS 2007 [pdf]
Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. Advances in Neural Information Processing Systems 2003 [pdf]
Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of the Seventeenth International Conference on Machine Learning 2000 [ps]
Dan Feldman, Morteza Monemizadeh, Christian Sohler. A PTAS for k-Means Clustering Based on Weak Coresets. Proc. 23th Annu. ACM Symposium on Computational Geometry (SoCG) 2007 [pdf]
Chris Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Chapter 9
Percy Liang, Dan Klein. Online EM for Unsupervised Models. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2009 [pdf ]
Dan Feldman, Matthew Faulkner, Andreas Krause. Scalable Training of Mixture Models via Coresets. In Proc. Neural Information Processing Systems, 2011 [pdf ]
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 2002. [pdf]
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW-10, Raleigh, NC, April, 2010. [pdf]
Khalid El-Arini, Gaurav Veda, Dafna Shahaf and Carlos Guestrin. Turning Down the Noise in the Blogosphere. In Proc. of 15th International Conference on Knowledge Discovery and Data Mining (KDD 2009) [pdf]
Matthew Streeter, Daniel Golovin. An Online Algorithm for Maximizing Submodular Functions. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2008 [pdf]
Matthew Streeter, Daniel Golovin, Andreas Krause. Online Learning of Assignments. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 [pdf]
Andreas Krause, Daniel Golovin. Submodular Function Maximization. Survey [pdf]
Yisong Yue, Carlos Guestrin. Linear Submodular Bandits and their Application to Diversified Retrieval. Neural Information Processing Systems (NIPS), December, 2011 [pdf]

Related Courses

CS345a: Data Mining at Stanford University
15-826: Multimedia Databases and Data Mining at Carnegie Mellon University
CS/CNS/EE 253: Advanced Topics in Machine Learning at Caltech