OPIM410/672: Decision Support Systems
MW 1:30-3:00P, JMHH F65
Instructor: Shawndra Hill, JMHH 567, office hours Mondays 3:30-4:30, Fridays 2-5, or by appointment.
Teaching Assistant: Shachi Pandey [shachi.pandey@gmail.com]
Text: Data Mining Techniques, Second Edition by Michael Berry and Gordon Linoff Wiley, 2004 ISBN: 0-471-47064-3
Webcafe Page (You will need a Wharton account. If you don't have one, go to http://accounts.wharton.upenn.edu.)
Syllabus
Weka - You'll need to download this software to complete course assignments.
Human Subjects Disclosure: The completion of some of the assignments in this course may result in data of value for research on data mining/machine learning. If the data generated in the class are used in research, no information will be revealed about the identities of individuals or about the specific intellectual content of student work.
Outside Resources
data mining textbooks
datasets
News and Announcements
Optional: Sign up for Lunch with Shawndra Hill at POD
Mondays March 17, April 7,14 and Wednesdays April 9, 16
Departing Huntsman Hall Walnut Street Lobby at 11:40am sharp (Those who have class are welcome to join at 12)
Session Outline
Wed January 16
S1: Introduction to the Course
Required Reading:
How Verizon Cut Customer Churn, Das M., Financial Express, 10-2003
Reference Reading: (not required before class)
Chapters 1-2
Mining Business Databases, Brachman R.J, Khabaza, T., Klosgen W., Piatetsky-Shapiro, G. and Simoudis, E. Communications of the ACM, 1996, 39:11, pp.42-48
12 IT Skills That Employers Can’t Say No To, Brandel, M., Computerworld, 7-11-2007
Assignments:
Out:
Homework Assignment 1
Personal DM profile (Save to your computer rename with your last name first and upload to the profiles directory)
Wed January 23
S2: Introduction to Data Mining
Required Reading:
Chapters 1-2
Reference Reading:
A Golden Vein, The Economist, 1-04
Network-based Marketing: Identifying Likely Adopters via Consumer Networks, Hill, S., Provost, F., Volinsky, C., Statistical Science, 2006, 22:11, pp. 256-276
Assignments:
In:
Homework Assignment 1
Personal DM profile
Out:
Homework Assignment 2
Anonymous Feedback on Proposed Research Ideas
Mon January 28
S3: Introduction to Decision Trees
Required Reading:
Chapters 3,6 (pp 165 - 194)
Reference Reading:
Our Technology And Data, Farecast article
How To Buy Data Mining: A Framework For Avoiding Costly Project Pitfalls In Predictive Analytics, Eric A. King, E.A., DM Review, October 2005
An Insurance Policy For Low Airfares, Tedeschi, B., NY Times, January 22, 2007
Wed January 30
S4: Decision Trees Continued
Required Reading:
Chapter 6 (pp 165 - 194)
Reference Reading:
Joined-up thinking, The Economist, Apr 4th 2007
Taking Retailers' Cues Harrah's Taps Into the Science of Gambling, WSJ, 11-22-2004
Assignments:
In:
Homework Assignment 2
Anonymous Feedback on Proposed Research Ideas
Out:
Homework Assignment 3
Mon February 4
S5: Evaluation in Machine Learning
Required Reading
Chapter 4 (pp 95-108)
Crafting Papers on Machine Learning, P. Langley
The Case Against Accuracy Estimation for Comparing Classifiers, Provost, F., T. Fawcett, and R. Kohavi, In Proceedings of the Fifteenth International Conference on Machine Learning (ICML-98).
Wed February 6
S5: Cost Sensitive Learning
Required Reading
Chapter 4 (pp 95-108)
The Relationship Between Default Prediction And Lending Profits:Integrating ROCAnalysis And Loan Pricing, Stein, R., Journal of Banking & Finance,29 (2005) 1213-1236
Assignments:
In:
Homework Assignment 3 (Extended to Friday Feb 8)
Out:
2 Page Proposal for Group DM Project
Mon February 11
S7: Naive Bayes
Required Reading:
Chapter 8: pp.257-271
Reference Reading:
Learning and Evaluating Classifers under Sample Selection Bias, Zadrozny, B.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Domingos P. and Pazzani, M. , Machine Learning, 29, 103-130, 1997
What You Need To Know About Bayesian Spam Filtering ,Tschabitscher, H.
A Plan For Spam,Zdziarski, J.
The State Of Spam, A monthly Report, Generated by Symantec Messaging and Web Security, February 2007
Spam And The Ongoing Battle For The Inbox Goodman, J., G.V. Cormack, and D. Heckerman, Communications of the ACM, February 2007, Vol.50,No. 2, pp. 25-33
Wed February 13
S8: Association Rules, KNN, Clustering
Required Reading
Chapter 9: Pages 287-315
Chapter 8: pp 257 - 271
Chapter 11: 349-365
Assignments:
In:
2 Page Proposal for Group DM Project
Out:
Sign up for 20-30 minute consulting time to discuss your project
Mon February 18
S9: Weka Demo (MEETING IN LAB HH 380)
Reference Reading:
Weka Tutorial
An Intelligent Assistant for the Knowledge Discovery Process: An Ontology-based Approach, Bernstein, A., Provost, F., Hill, S. IEEE Transactions on Knowledge and Data Engineering 17(4), pp. 503-518, 2005. (PDF)
Assignments:
Out:
Homework Assignment 4
Wed February 20
S10: Genetic Algorithms
Required Reading
Chapter 13
Reference Reading:
Discovering Interesting Patterns For Investment Decision Making With GLOWER – A Genetic Learner Overlaid With Entropy Reduction, Dhar, V., D. Chou, and F. Provost, DataMining and Knowledge Discovery, Vol. 4, No. 4/October, 2000
Assignments:
Out:
First of two DM competition datasets (Friday Feb 22)
Mon February 25
S11: Neural Networks
Required Reading:
Chapter 7
Reference Reading:
The Ultimate Money Machine, Kelley J. Bloomberg Markets, June 2007 (A must read:))
Assignments:
In:
Homework Assignment 4
Out:
Wed February 27
S12: Data Mining for Business Inteligence
Required Reading
Reference Reading:
Assignments:
In: Group Presentations in PPT form (by Saturday March 1 10am)
Mon March 3
S13: Group Presentations
Wed March 5
S14: Group Presentations
Assignments:
Out: Have a Great Spring Break!
Mon March 17
S15: Weka Lab (MEETING IN LAB HH 380)
Reference Reading:
Weka Tutorial
Assignments:
Out:
Homework Assignment 5
Wed March 19
S16: Relational Learning P1
(come with an open mind)
Recommended Reading
Social Graph-iti, Oct 18th 2007
On Facebook, Scholars Link Up With Data, Stephanie Rosenbloom, December 17, 2007
Friend Accepted, The Economist, Oct 25th 2007
Six Degrees of Messaging, NatureNews, Katharine Sanderson, March 13, 2008
Mon March 24
S17: Relational Learning P2
Required Reading
The New Focus Groups: Online Networks, Emily Steel, WSJ, January 14, 2008
Fun List of Social Networking Sites from Mashable
Data Mining: Staking A Claim On Your Privacy, Cavoukian, A., Ph.D., Commissioner, Information and Privacy Commissioner/Ontario, January 1998, pp. i, ii, iii, 1-22
Fair Information Practice Principles
Online Ads vs. Privacy, Dan Mitchell, NY Times, May 12,2007
Big Brother Just wants to help, The Economist, March 8, 2007
Recommended Reading
Privacy Preserving Data Mining, Rakesh Agrawal, et. al. ACM SIGMOD International Conference of Management of Data (SIGMOD), 2000.
k-Anonymity: a model for protecting privacy, Latanya Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based System, 2002.
Mondrian Multidimensional K-Anonymity, Kristen LeFevre, et. al. IEEE International Conference on Data Engineering, 2006.
Wed March 26
S18: Harrah's Case
Required Reading
Revisit Chapters 1-2
Assignments:
In: Homework Assignment 5
Monday March 31
S19: Recommendation Systems/Collaborative Filtering
Recommended Reading
Amazon.com Recommendations: Item-to-Item Collaborative Filtering, Linden, G., B.Smith, & J. York, IEEE Computer Society, IEEE Internet Computing, Jan./Feb. 2003, pp. 76-80
Speaking out: Amazon.com's Jeff Bezos, The McGraw-Hill Companies, BusinessWeek Online, August 25, 2003
Netflix Prize Still Awaits a Movie Seer, Katie Hafner, NY Times, June 4,2007
You Want Innovation? Offer A Prize,Leonhardt, D., NY Times, Economix section, January 31, 2007
MySpace to Discuss Effort to Customize Ads, Brad Stone, NY Times, September 18, 2007
Wed April 2
S20: Weka Lab for DM Competition (MEETING IN LAB HH 380)
Monday April 7
S21: Guest Speaker: Claudia Perlich, IBM Research
Claudia Perlich has received her M.Sc. in Computer Science from Colorado University at Boulder, Diplom in Computer Science from Technische Universitaet in Darmstadt, and her Ph.D. in Information Systems from Stern School of Business, New York University. Her Ph.D. thesis concentrated on probability estimation in multi-relational domains that capture information of multiple entity types and relationships between them. Her dissertation was recognized as an additional winner of the International SAP Doctoral Support Award Competition and her submission placed second in the yearly data mining competition in 2003 (KDD-Cup 03).
Claudia joined the Data Analytics Research group as a Research Staff Member in October 2004. She interned during summer 1999 at Deep Computing for Commerce Research Group under Murray Campbell working on financial trading behavior on Treasury Bonds. Her research interests are in machine learning for complex real-world domains and the comparative study of model performance as a function of domain characteristics.
Required Reading
Making the Most of Your Data: KDD Cup 2007 “How Many Ratings” Winner’s Report, S. Rosset, C. Perlich, Y. Liu
Wednesday April 9
S22: Guest Speaker: Robert Bell, AT&T Labs Research
Robert Bell has been a member of the Statistics Research Department at AT&T Labs-Research since 1998. He previously worked at RAND doing public policy analysis. His current research interests include machine learning methods, analysis of data from complex samples, and record linkage methods. He has served on several National Research Council panels advising the Census Bureau and chairs a current panel on coverage measurement for the 2010 census. He is currently a member of the board of the National Institute of Statistical Sciences and was recently a member of the Committee on National Statistics and chair of the Fellows
Committee of the American Statistical Association.
Required Reading
http://www.wired.com/techbiz/media/magazine/16-03/mf_netflix
http://stat-computing.org/newsletter/v182.pdf (pp. 4-12)
Recommended Reading
http://www.research.att.com/~volinsky/netflix/
Monday April 14
S23: (MEETING IN LAB HH 380/OPTIONAL!)
Wednesday April 16
S24: Guest Speaker: Daryl Pregibon, Google, Inc.
Daryl Pregibon is the research scientist at Google, Inc. He is a recognized leader in data mining, the interdisciplinary field that combines statistics, artificial intelligence, and data base research. His research interests include analysis of massive data sets, statistical computing, generalized linear models, tree-based methods, and regression diagnostics. During his career, Dr. Pregibon has nurtured successful interactions in fiber and microelectronics manufacturing, network reliability, customer satisfaction, fraud detection, targeted marketing, and regulatory statistics. Over these years, his research contributions changed from mathematical statistics to computational statistics and included such topics as expert systems for data analysis, data visualization, application-specific data structures for statistics, and large-scale data analysis. From 1989-2004, he worked at AT&T and served as head, statistics research. He is currently a member of the NAS Committee on National Statistics; the NAS Study Committee on Ballistics and former chair of the NAS Committee on Applied & Theoretical Statistics. He has also held positions on the National Advisory Committee for the Statistical and Applied Mathematical Sciences Institute (SAMSI), Research Triangle Park and is director of the Association for Computer Machinery (ACM) Special Interest Group on Knowledge Development and Data Mining (SIGKDD). Other previous academic and professional experiences include: associate editor of Data Mining & Knowledge Discovery; associate editor, Statistics & Computing; and co-founder of the Society for Artificial Intelligence & Statistics (SAIAS). He has authored more than 60 publications and holds four patents. Dr. Pregibon received his Ph.D. in statistics from the University of Toronto and his M.A. in mathematics from Youngstown State University (source: The National Academies).
Mon April 21
S25: Guest Speaker: Steven L. Scott, Capital One
Steven L. Scott received his PhD from the Harvard statistics department in 1998. From 1998 to 2007 he served on the faculty of the Marshall School of Business at the University of Southern California. Dr. Scott's research focuses on applied Bayesian computation in a diverse set of fields including web traffic modeling, e-commerce, network security, health policy research, and educational testing. Several of his papers have appeared in the Journal of the American Statistical Association, the premier journal in the field of statistics. He has had consulting relationships with several companies ranging from AT&T-Bell Labs, to a psychic hotline, to the McKinsey Corporation. In June of
2007 Dr. Scott left USC to join Capital One, where he now serves as a Director of Statistical Analysis.
Recommended Reading
Competing on Analytics: The New Science of Winning, T. Davenport, J. Harris
Wed April 23
S26: Group Presentations
Monday April 28
S27: Group Presentations/Data Mining Competition Winner Announced
Comments (0)
You don't have permission to comment on this page.