Data Mining: Learning from Large Data Sets


Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In this graduate-level course, students will learn to apply, analyze and evaluate principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets.


  • Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
  • Fast nearest neighbor methods (Shingling, locality sensitive hashing)
  • Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
  • Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
  • Active learning (uncertainty sampling, pool-based methods, label complexity)
  • Dimension reduction (random projections, nonlinear methods)
  • Data streams (Sketches, coresets, applications to online clustering)
  • Recommender systems


  • Please watch project website for updates. First part of project due on April 14.
  • First class on 19.2.


  • VVZ Information: See here.
  • Recitations:
    • Tue 13-14 in CAB G 61. Last names starting with A-L
    • Fri 14-15 in NO C 6. Last names starting with M-Z
  • Textbook: A. Rajaraman, J. Ullman. Mining of Massive Data Sets. Available here.


  • Self Assessment Questions [pdf]
  • Homework 1 [pdf]. Data set [zip]
  • Homework 2 [pdf]
  • Homework 3 [pdf]
  • Homework 4 [pdf]
  • Homework 5 [pdf]
  • Homework 6 [pdf]


  • Homework 1 [zip]
  • Homework 2 [pdf]
  • Homework 3 [pdf]
  • Homework 4 [pdf]
  • Homework 5 [pdf]
  • Homework 6 [pdf]

Lecture Notes

  • February 19: Introduction [pdf]
  • February 26: Approximate Retrieval; Min-hashing [pdf]
  • March 5: Locality sensitive hashing [pdf]
  • March 12: SVMs; online convex programming [pdf]
  • March 19: (Parallel) stochastic gradient descent [pdf]
  • March 26: Feature selection via l1-regularization; multi-class/structured prediction [pdf]
  • April 9: Active Learning [pdf]
  • April 16: Large scale unsupervised learning (Online k-means, coresets) [pdf]
  • April 23: Large scale unsupervised learning (Online EM, coresets, anomaly detection) [pdf]
  • April 30: Exploration--exploitation tradeoffs (k-armed bandits, upper confidence sampling) [pdf]
  • May 7: Contextual bandits [pdf]
  • May 14: Submodular functions (properties, algorithms and applications) [pdf]
  • May 28: Recommending sets (structured prediction, online submodular optimization) [pdf]


  • Feb 26: Hadoop tutorial [pdf]
  • Mar 5: LSH & NN [pdf]
  • Mar 12: SVM [pdf]
  • Mar 19: Project 1 - Approximate Retrieval
  • April 9: Online SVMs (HW3/Loss Functions/L1 Regularization)
  • April 16: Project 2 - Large-scale Classification [pdf]
  • April 23: Active Learning [pdf]
  • April 30: Unsupervised Learning
  • May 7: Exploration-Exploitation [pdf]
  • May 14: Project 3 - Recommender Systems [pdf]

Old Exams

  • Data Mining Exam, 2012 Spring [pdf]

Relevant Readings

  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
  • Jure Leskovec, Eric Horvitz. Planetary-Scale Views on a Large Instant-Messaging Network. In Proceedings WWW 2008 [pdf]
  • Manuel Gomez Rodriguez, Jure Leskovec, Andreas Krause. Inferring Networks of Diffusion and Influence, In Proc. ACM Conference on Knowledge Discovery in Databases (KDD), 2010 [pdf]
  • James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.
  • Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is "Nearest Neighbor" Meaningful?. Proc. 7th International Conference on Database Theory (ICDT 1999) LNCS 1540: 217-235.
  • Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search in High Dimensions via Hashing [ps]. 25th International Conference on Very Large Databases (VLDB) , 1999
  • Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. Proc. International Conference on Machine Learning 2003 [pdf]
  • Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 2010 [pdf]
  • Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm support vector machines. Advances in Neural Information Processing Systems 2003 [html]
  • John Duchi, Shai Shalev-Shwartz, Yoram Singer, Tushar Chandra. Efficient Projections onto the l1-Ball for Learning in High Dimensions. ICML 2008 [pdf]
  • Nathan Ratliff, J. Andrew (Drew) Bagnell, and Martin Zinkevich. (Online) Subgradient Methods for Structured Prediction. AISTATS 2007 [pdf]
  • Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. Advances in Neural Information Processing Systems 2003 [pdf]
  • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of the Seventeenth International Conference on Machine Learning 2000 [ps]
  • Dan Feldman, Morteza Monemizadeh, Christian Sohler. A PTAS for k-Means Clustering Based on Weak Coresets. Proc. 23th Annu. ACM Symposium on Computational Geometry (SoCG) 2007 [pdf]
  • Chris Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Chapter 9
  • Percy Liang, Dan Klein. Online EM for Unsupervised Models. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2009 [pdf ]
  • Dan Feldman, Matthew Faulkner, Andreas Krause. Scalable Training of Mixture Models via Coresets. In Proc. Neural Information Processing Systems, 2011 [pdf ]
  • Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 2002. [pdf]
  • Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW-10, Raleigh, NC, April, 2010. [pdf]
  • Khalid El-Arini, Gaurav Veda, Dafna Shahaf and Carlos Guestrin. Turning Down the Noise in the Blogosphere. In Proc. of 15th International Conference on Knowledge Discovery and Data Mining (KDD 2009) [pdf]
  • Matthew Streeter, Daniel Golovin. An Online Algorithm for Maximizing Submodular Functions. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2008 [pdf]
  • Matthew Streeter, Daniel Golovin, Andreas Krause. Online Learning of Assignments. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 [pdf]
  • Andreas Krause, Daniel Golovin. Submodular Function Maximization. Survey [pdf]
  • Yisong Yue, Carlos Guestrin. Linear Submodular Bandits and their Application to Diversified Retrieval. Neural Information Processing Systems (NIPS), December, 2011 [pdf]

Related Courses