Data Mining: Learning from Large Data Sets
Overview
Many scientific and commercial applications require us to obtain
insights from massive, high-dimensional data sets. In this
graduate-level course, students will learn to apply, analyze and evaluate principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets.
Topics
- Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
- Fast nearest neighbor methods (Shingling, locality sensitive hashing)
- Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
- Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
- Active learning (uncertainty sampling, pool-based methods, label complexity)
- Dimension reduction (random projections, nonlinear methods)
- Data streams (Sketches, coresets, applications to online clustering)
- Recommender systems
News
- Please watch project
website for updates. First part of project due on April 14.
- First class on 19.2.
Details
- VVZ Information: See here.
- Recitations:
- Tue 13-14 in CAB G 61. Last names starting with A-L
- Fri 14-15 in NO C 6. Last names starting with M-Z
- Textbook: A. Rajaraman, J. Ullman. Mining of Massive Data Sets. Available here.
Homeworks
- Self Assessment Questions [pdf]
- Homework 1 [pdf]. Data set [zip]
- Homework 2 [pdf]
- Homework 3 [pdf]
- Homework 4 [pdf]
- Homework 5 [pdf]
- Homework 6 [pdf]
Solutions
- Homework 1 [zip]
- Homework 2 [pdf]
- Homework 3 [pdf]
- Homework 4 [pdf]
- Homework 5 [pdf]
- Homework 6 [pdf]
Lecture Notes
- February 19: Introduction [pdf]
- February 26: Approximate Retrieval; Min-hashing [pdf]
- March 5: Locality sensitive hashing [pdf]
- March 12: SVMs; online convex programming [pdf]
- March 19: (Parallel) stochastic gradient descent [pdf]
- March 26: Feature selection via l1-regularization; multi-class/structured prediction [pdf]
- April 9: Active Learning [pdf]
- April 16: Large scale unsupervised learning (Online k-means, coresets) [pdf]
- April 23: Large scale unsupervised learning (Online EM, coresets,
anomaly detection) [pdf]
- April 30: Exploration--exploitation tradeoffs (k-armed bandits,
upper confidence sampling) [pdf]
- May 7: Contextual bandits [pdf]
- May 14: Submodular functions (properties, algorithms and applications) [pdf]
- May 28: Recommending sets (structured prediction, online
submodular optimization) [pdf]
Recitations
- Feb 26: Hadoop tutorial [pdf]
- Mar 5: LSH & NN [pdf]
- Mar 12: SVM [pdf]
- Mar 19: Project 1 - Approximate Retrieval
- April 9: Online SVMs (HW3/Loss Functions/L1 Regularization)
- April 16: Project 2 - Large-scale Classification [pdf]
- April 23: Active Learning [pdf]
- April 30: Unsupervised Learning
- May 7: Exploration-Exploitation [pdf]
- May 14: Project 3 - Recommender Systems [pdf]
Old Exams
- Data Mining Exam, 2012 Spring [pdf]
Relevant Readings
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
- Jure Leskovec, Eric Horvitz. Planetary-Scale Views on a Large
Instant-Messaging Network. In Proceedings WWW 2008 [pdf]
- Manuel Gomez Rodriguez, Jure Leskovec, Andreas Krause. Inferring
Networks of Diffusion and Influence, In Proc. ACM Conference on
Knowledge Discovery in Databases (KDD), 2010 [pdf]
- James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.
- Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is "Nearest Neighbor" Meaningful?. Proc. 7th International Conference on Database Theory (ICDT 1999) LNCS 1540: 217-235.
- Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search
in High Dimensions via Hashing [ps]. 25th
International Conference on Very Large Databases (VLDB) , 1999
- Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. Proc. International Conference on Machine Learning 2003 [pdf]
- Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 2010 [pdf]
- Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm
support vector machines. Advances in Neural Information Processing
Systems 2003 [html]
- John Duchi, Shai Shalev-Shwartz, Yoram Singer, Tushar
Chandra. Efficient Projections onto the l1-Ball for Learning in
High Dimensions. ICML 2008
[pdf]
- Nathan Ratliff, J. Andrew (Drew) Bagnell, and Martin
Zinkevich. (Online) Subgradient Methods for Structured
Prediction. AISTATS 2007 [pdf]
- Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. Advances in Neural Information Processing Systems 2003 [pdf]
- Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of the Seventeenth International Conference on Machine Learning 2000 [ps]
- Dan Feldman, Morteza Monemizadeh, Christian Sohler. A PTAS for
k-Means Clustering Based on Weak Coresets. Proc. 23th Annu. ACM Symposium on Computational Geometry (SoCG) 2007 [pdf]
- Chris Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Chapter 9
- Percy Liang, Dan Klein. Online EM for Unsupervised Models. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2009 [pdf
]
- Dan Feldman, Matthew Faulkner, Andreas Krause. Scalable Training of Mixture Models via Coresets. In Proc. Neural Information Processing Systems, 2011 [pdf
]
- Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 2002.
[pdf]
- Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article
recommendation. In WWW-10, Raleigh, NC, April, 2010.
[pdf]
- Khalid El-Arini, Gaurav Veda, Dafna Shahaf and Carlos Guestrin. Turning Down the Noise in the Blogosphere. In Proc. of 15th International Conference on Knowledge Discovery and Data Mining (KDD 2009) [pdf]
- Matthew Streeter, Daniel Golovin. An Online Algorithm for
Maximizing Submodular Functions. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2008 [pdf]
- Matthew Streeter, Daniel Golovin, Andreas Krause. Online Learning of Assignments. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 [pdf]
- Andreas Krause, Daniel Golovin. Submodular Function
Maximization. Survey [pdf]
- Yisong Yue, Carlos Guestrin. Linear Submodular Bandits and their
Application to Diversified Retrieval. Neural Information
Processing Systems (NIPS), December, 2011 [pdf]
Related Courses