Data Mining: Learning from Large Data Sets

Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In this graduate-level course, students will learn to apply, analyze and evaluate principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets. VVZ information is available here.

Syllabus

Date	Topics	Tutorials	Exercises	Solutions
Sep 17	Introduction; MapReduce	-	-	-
Sep 24	Approximate Retrieval; Min-hashing	[pdf]	[pdf][zip]	[pdf]
Oct 1	Locality sensitive hashing	-	-	-
Oct 8	SVMs; Online Convex Programming	[pdf]	[pdf]	[zip]
Oct 15	Stochastic Gradient Descent	-	-	-
Oct 22	Adapting to Geometry	-	[pdf][zip]	[pdf][zip]
Oct 29	Parallelization; Random Features	-	-	-
Nov 5	Random Features; Active Learning	-	[pdf]	[pdf]
Nov 12	Large-scale unsupervised learning: k-means	-	-	-
Nov 19	Online k-means; Coresets	[ipynb]	[pdf]	[pdf]
Nov 26	Multi-armed Bandits; LinUCB		-	-
Dec 3	Contextual bandits; evaluation	[ipynb] [html]	[pdf]	[pdf]
Dec 17	Recommending lists: Submodular functions / bandits	-	-	-

Contact

Instructor	Prof. Andreas Krause
Head TA	Olivier Bachem
Assistants	Josip Djolonga, Alkis Gotovos, Baharan Mirzasoleiman, Mario Lučić, Hamed Hassani
Mailing List	If you have any questions please send them to dm2015@lists.inf.ethz.ch from your ethz.ch address.

Lectures

Thu

13-15

HG D 1.1

Tutorials

Thu 15-17	HG D 3.2
Thu 15-17	HG D 3.2

News

Exam review session: Wednesday, January 20th, 2016, 14.00-15.30, CAB G 61. We will answer questions that are sent by email to dm2015@lists.inf.ethz.ch until Sunday, January 17th, 2016.
On the project server, we have released preliminary project grades for all groups.
We have posted two old exams below to help you prepare for the exam.
Web site updated: If you can’t download some of the materials, please clear your browser cache.
First class on 17.9. First tutorial/recitation on 24.9.
To obtain the password please click here and login using your nethz account or login using the ETH VPN client and click here.

Topics

Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
Fast nearest neighbor methods (Shingling, locality sensitive hashing)
Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
Active learning (uncertainty sampling, pool-based methods, label complexity)
Dimension reduction (random projections, nonlinear methods)
Data streams (Sketches, coresets, applications to online clustering)
Recommender systems

Exercises

The exercise problems will include theoretical and programming problems. Please note that it is not mandatory to submit solutions. We will publish exercise solutions after one week. If you choose to submit:

Send a soft copy of the exercise from your ethz.ch address to dm2015@lists.inf.ethz.ch. This can be latex, but also a simple scan or even a picture of a hand-written solution.
Please do not submit hard copies of your solutions.

Project

Part of the coursework will be a project, carried out in groups of up to 3 students. The goal of this project is to get hands-on experience in machine learning tasks. The project grade will constitute 30% of the total grade. More details on the project will be given in the tutorials. Register your team here. The project details are displayed after you login here.

Exam

The mode of examination is written, 120 minutes length. The language of examination is English. As written aids, you can bring two A4 pages (i.e. one A4 sheet of paper), either handwritten or 11 point minimum font size. The written exam will constitute 70% of the total grade.

Old Exam

You may download two previous exams here: [2013] [2014].

Resources

Text Book

J. Leskovec, A. Rajaraman, J. Ullman. Mining of Massive Data Sets. Available here.

References

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
Jure Leskovec, Eric Horvitz. Planetary-Scale Views on a Large Instant-Messaging Network. In Proceedings WWW 2008 [pdf]
Manuel Gomez Rodriguez, Jure Leskovec, Andreas Krause. Inferring Networks of Diffusion and Influence, In Proc. ACM Conference on Knowledge Discovery in Databases (KDD), 2010 [pdf]
James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is “Nearest Neighbor” Meaningful?. Proc. 7th International Conference on Database Theory (ICDT 1999) LNCS 1540: 217-235.
Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search in High Dimensions via Hashing [ps]. 25th International Conference on Very Large Databases (VLDB) , 1999
Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. Proc. International Conference on Machine Learning 2003 [pdf]
Elad Hazan, Amit Agarwal, Satyen Kale. Logarithmic Regret Algorithms for Online Convex Optimization. Machine Learning Journal Volume 69 , Issue 2-3 Pages: 169 – 192 [pdf]
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research 12 (2011): 2121-2159. [pdf]
Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 2010 [pdf]
Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. Advances in Neural Information Processing Systems 24, 2011 [pdf]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, 2012. [pdf]
Ali Rahimi, Ben Recht. Random Features for Large-Scale Kernel Machines. Advances in Neural Information Processing Systems 20, 2007 [pdf]
Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou. Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison. Advances in Neural Information Processing Systems 25, 2012 [pdf]
Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm support vector machines. Advances in Neural Information Processing Systems 2003 [html]
John Duchi, Shai Shalev-Shwartz, Yoram Singer, Tushar Chandra. Efficient Projections onto the l1-Ball for Learning in High Dimensions. ICML 2008 [pdf]
Nathan Ratliff, J. Andrew (Drew) Bagnell, and Martin Zinkevich. (Online) Subgradient Methods for Structured Prediction. AISTATS 2007 [pdf]
Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. Advances in Neural Information Processing Systems 2003 [pdf]
Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of the Seventeenth International Conference on Machine Learning 2000 [ps]
Dan Feldman, Morteza Monemizadeh, Christian Sohler. A PTAS for k-Means Clustering Based on Weak Coresets. Proc. 23th Annu. ACM Symposium on Computational Geometry (SoCG) 2007 [pdf]
Chris Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Chapter 9
Percy Liang, Dan Klein. Online EM for Unsupervised Models. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2009 [pdf ]
Dan Feldman, Matthew Faulkner, Andreas Krause. Scalable Training of Mixture Models via Coresets. In Proc. Neural Information Processing Systems, 2011 [pdf ]
Mario Lucic, Olivier Bachem, Andreas Krause. Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures, 2015 [pdf]
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 2002. [pdf]
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW-10, Raleigh, NC, April, 2010. [pdf]
Khalid El-Arini, Gaurav Veda, Dafna Shahaf and Carlos Guestrin. Turning Down the Noise in the Blogosphere. In Proc. of 15th International Conference on Knowledge Discovery and Data Mining (KDD 2009) [pdf]
Andreas Krause, Daniel Golovin. Submodular Function Maximization. Survey [pdf]
Matthew Streeter, Daniel Golovin. An Online Algorithm for Maximizing Submodular Functions. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2008 [pdf]
Matthew Streeter, Daniel Golovin, Andreas Krause. Online Learning of Assignments. In Proc. 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 [pdf]
Yisong Yue, Carlos Guestrin. Linear Submodular Bandits and their Application to Diversified Retrieval. Neural Information Processing Systems (NIPS), December, 2011 [pdf]